基于数据挖掘的恶意网站检测技术研究

发布时间：2019-06-25 10:03

【摘要】：随着互联网的发展,网络安全日益受到人们关注。恶意网站攻击事件的频繁发生,给用户带来了巨大的财产损失,同时也严重威胁了个人甚至国家的安全。因此,建立一定的模型,并对恶意网站进行识别和检测具有非常重要的意义。目前国内外很多学者对特征选择方法进行了改进,他们多集中在对主机特征与词汇特征两个方面进行深入挖掘与改进,但是仍然存在准确率与效率不高的情况。针对这些问题,在特征提取问题上,本文首先提出了建立易受攻击网站名单的概念,并在此基础上提出了基于加权距离的新特征提取方案。同时在数据挖掘算法上本文基于改进的模糊C均值聚类算法对KNN模型进行改进,提高了模型的效率。本文的研究工作主要包括:数据采集:本文对正常网站和恶意网站的数据分别进行爬取,清洗,标准化处理与入库操作,最终把数据放到MySQL数据库中。特征提取:异于常见的网站白名单、网站黑名单的概念,文中把容易被攻击的的网站进行汇总,提出了建立易受攻击网站名单的概念。同时恶意网站通常在正常网站的基础上进行一定程度的更改,根据更改类型设定不同的权重,提出了加权距离的概念,对任一输入URL计算其与易受攻击网站名单中URL间的最近加权距离距离,并把它作为新的特征。模型改进:本文首先对KNN算法和模糊C均值算法进行了改进,针对FCM初始聚类中心不确定,容易陷入局部最优的缺点,本文提出了坐标密度法,确定初始聚类中心。针对FCM算法的初始聚类个数随机选取的问题提出了运用K值和数据集个数来确定的方法,最终获取样本的聚类中心和聚类中心所在的簇。通过找到距离测试集距离最小的聚类中心所在簇,来确定测试集的类别。模型验证:本文采用了 LR模型,J48模型以及改进的KNN模型,运用WEKA对数据进行分类。同时把加入新特征的数据和运用原始特征的数据运用数据挖掘算法进行分类及准确性对比,最终,分类结果得到一定提高。同时和其他文献中方法进行对比,发现特征具有较好的效果。
[Abstract]:With the development of the Internet, network security has been paid more and more attention. The frequent occurrence of malicious website attacks has brought huge property losses to users, but also seriously threatened the security of individuals and even countries. Therefore, it is of great significance to establish a certain model and identify and detect malicious websites. At present, many scholars at home and abroad have improved the feature selection methods, most of them focus on the host features and lexical features of the two aspects of in-depth mining and improvement, but there are still low accuracy and efficiency. In order to solve these problems, in this paper, the concept of establishing the list of vulnerable websites is proposed, and a new feature extraction scheme based on weighted distance is proposed. At the same time, in the data mining algorithm, this paper improves the KNN model based on the improved fuzzy C-means clustering algorithm, and improves the efficiency of the model. The research work of this paper mainly includes: data acquisition: this paper crawls, cleans, standardizes and stores the data of normal website and malicious website respectively, and finally puts the data into MySQL database. Feature extraction: different from the common concepts of website whitelist and website blacklist, this paper summarizes the vulnerable websites and puts forward the concept of establishing vulnerable website lists. At the same time, malicious websites usually change to a certain extent on the basis of normal websites. According to the different weights of the change types, the concept of weighted distance is put forward, and the nearest weighted distance between malicious websites and URL in the list of vulnerable sites is calculated for any input URL, and it is regarded as a new feature. Model improvement: in this paper, the KNN algorithm and fuzzy C-means algorithm are improved. In order to solve the problem that the initial clustering center of FCM is uncertain and easy to fall into local optimization, the coordinate density method is proposed to determine the initial clustering center. In order to solve the problem of random selection of the initial clustering number of FCM algorithm, a method is proposed to determine the K value and the number of data sets. Finally, the clustering center of the sample and the cluster in which the clustering center is located are obtained. By finding the cluster with the smallest distance from the test set, the category of the test set is determined. Model verification: in this paper, LR model, J48 model and improved KNN model are used to classify the data by WEKA. At the same time, the data with new features and the data using original features are compared with the data mining algorithm. Finally, the classification results are improved to a certain extent. At the same time, compared with other methods in the literature, it is found that the characteristics have better results.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092;TP311.13

【参考文献】