概述
什么是DGA域名?域名生成算法(Domain Generation Algorithm),是中心结构僵尸网络赖以生存的关键武器。僵尸网络水很深,以后会在 信息安全入门笔记 专栏里讲。
1.数据搜集和数据清洗
使用alexa前1000域名(679个样本)作为白样本,标记为0,由于域名长度设置最低长度10,所以选取到679个样本,
使用dga-cryptolocker(1000个样本)和dga-tovar-goz(1000个样本)做为黑样本,标记为1和2,总共2679个样本。
x1_domain_list = load_alexa("../data/top-1000.csv")
x2_domain_list = load_dga("../data/dga-cryptolocke-1000.txt")
x3_domain_list = load_dga("../data/dga-post-tovar-goz-1000.txt")
x_domain_list=np.concatenate((x1_domain_list, x2_domain_list,x3_domain_list))
y1=[0]*len(x1_domain_list)
y2=[1]*len(x2_domain_list)
y3=[2]*len(x3_domain_list)
y=np.concatenate((y1, y2,y3))
2.特征化
以2-gram
处理dga域名,切割单元是字符,以整个数据集合的2-gram
结果作为词汇表并并行映射,得到特征化的向量:
cv = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
token_pattern=r"w", min_df=1)
x= cv.fit_transform(x_domain_list).toarray()
特征名称向量像下面这样:
[u'0 0', u'0 1', u'0 2', u'0 3', u'0 4', u'0 5', u'0 6', u'0 7', u'0 8', u'0 9',u'z u', u'z v', u'z w', u'z x', u'z y', u'z z']
3.训练样本
实例化NB算法:
clf = GaussianNB()
4.效果验证
2679个样本不算多,采用10折交叉验证:
score = cross_validation.cross_val_score(clf, x, y, n_jobs=-1, cv=10)
print score
print np.mean(score)
94%准确率
[0.98134328 0.94029851 0.94402985 0.93656716 0.94402985 0.93283582
0.94029851 0.93656716 0.94776119 0.96629213]
0.9470023478115044
5.完整代码
import sys
import urllib
import urlparse
import re
from hmmlearn import hmm
import numpy as np
from sklearn.externals import joblib
import HTMLParser
import nltk
import csv
import matplotlib.pyplot as plt
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
#处理域名的最小长度
MIN_LEN=10
def load_alexa(filename):
domain_list=[]
csv_reader = csv.reader(open(filename))
for row in csv_reader:
domain=row[1]
if len(domain) >= MIN_LEN:
domain_list.append(domain)
return domain_list
def load_dga(filename):
domain_list=[]
#xsxqeadsbgvpdke.co.uk,Domain used by Cryptolocker - Flashback DGA for 13 Apr 2017,2017-04-13,
# http://osint.bambenekconsulting.com/manual/cl.txt
with open(filename) as f:
for line in f:
domain=line.split(",")[0]
if len(domain) >= MIN_LEN:
domain_list.append(domain)
return domain_list
def nb_dga():
x1_domain_list = load_alexa("../data/top-1000.csv")
x2_domain_list = load_dga("../data/dga-cryptolocke-1000.txt")
x3_domain_list = load_dga("../data/dga-post-tovar-goz-1000.txt")
x_domain_list=np.concatenate((x1_domain_list, x2_domain_list,x3_domain_list))
y1=[0]*len(x1_domain_list)
y2=[1]*len(x2_domain_list)
y3=[2]*len(x3_domain_list)
print len(y1),len(y2),len(y3)
y=np.concatenate((y1, y2,y3))
print len(y)
cv = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
token_pattern=r"w", min_df=1)
x= cv.fit_transform(x_domain_list).toarray()
print cv.get_feature_names()
print len(x)
clf = GaussianNB()
score = cross_validation.cross_val_score(clf, x, y, n_jobs=-1, cv=10)
print score
print np.mean(score)
if __name__ == '__main__':
nb_dga()
最后
以上就是畅快野狼为你收集整理的朴素贝叶斯检测DGA域名的全部内容,希望文章能够帮你解决朴素贝叶斯检测DGA域名所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复