基于数据挖掘的恶意网站检测技术研究
[Abstract]:With the development of the Internet, network security has been paid more and more attention. The frequent occurrence of malicious website attacks has brought huge property losses to users, but also seriously threatened the security of individuals and even countries. Therefore, it is of great significance to establish a certain model and identify and detect malicious websites. At present, many scholars at home and abroad have improved the feature selection methods, most of them focus on the host features and lexical features of the two aspects of in-depth mining and improvement, but there are still low accuracy and efficiency. In order to solve these problems, in this paper, the concept of establishing the list of vulnerable websites is proposed, and a new feature extraction scheme based on weighted distance is proposed. At the same time, in the data mining algorithm, this paper improves the KNN model based on the improved fuzzy C-means clustering algorithm, and improves the efficiency of the model. The research work of this paper mainly includes: data acquisition: this paper crawls, cleans, standardizes and stores the data of normal website and malicious website respectively, and finally puts the data into MySQL database. Feature extraction: different from the common concepts of website whitelist and website blacklist, this paper summarizes the vulnerable websites and puts forward the concept of establishing vulnerable website lists. At the same time, malicious websites usually change to a certain extent on the basis of normal websites. According to the different weights of the change types, the concept of weighted distance is put forward, and the nearest weighted distance between malicious websites and URL in the list of vulnerable sites is calculated for any input URL, and it is regarded as a new feature. Model improvement: in this paper, the KNN algorithm and fuzzy C-means algorithm are improved. In order to solve the problem that the initial clustering center of FCM is uncertain and easy to fall into local optimization, the coordinate density method is proposed to determine the initial clustering center. In order to solve the problem of random selection of the initial clustering number of FCM algorithm, a method is proposed to determine the K value and the number of data sets. Finally, the clustering center of the sample and the cluster in which the clustering center is located are obtained. By finding the cluster with the smallest distance from the test set, the category of the test set is determined. Model verification: in this paper, LR model, J48 model and improved KNN model are used to classify the data by WEKA. At the same time, the data with new features and the data using original features are compared with the data mining algorithm. Finally, the classification results are improved to a certain extent. At the same time, compared with other methods in the literature, it is found that the characteristics have better results.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092;TP311.13
【参考文献】
相关期刊论文 前10条
1 周庆平;谭长庚;王宏君;湛淼湘;;基于聚类改进的KNN文本分类算法[J];计算机应用研究;2016年11期
2 陈庄;刘龙飞;;融合域名注册信息的恶意网站检测方法研究[J];计算机光盘软件与应用;2015年01期
3 曹玖新;董丹;毛波;王田峰;;基于URL特征的Phishing检测方法(英文)[J];Journal of Southeast University(English Edition);2013年02期
4 李洋;刘飚;封化民;;基于机器学习的网页恶意代码检测方法[J];北京电子科技学院学报;2012年04期
5 刘喜梅;雷达;;一种改进的模糊C均值聚类算法[J];青岛科技大学学报(自然科学版);2011年02期
6 胡明;刘嘉勇;刘亮;;一种基于代码特征的网页木马改良模型研究[J];通信技术;2010年08期
7 张孝飞;黄河燕;;一种采用聚类技术改进的KNN文本分类方法[J];模式识别与人工智能;2009年06期
8 吕晓燕;罗立民;李祥生;;FCM算法的改进及仿真实验研究[J];计算机工程与应用;2009年20期
9 张慧哲;王坚;;基于初始聚类中心选取的改进FCM聚类算法[J];计算机科学;2009年06期
10 吴润浦;方勇;吴少华;;基于统计与代码特征分析的网页木马检测模型[J];信息与电子工程;2009年01期
相关会议论文 前1条
1 刘琪;牛文静;;正则表达式在恶意代码动态分析中的应用[A];2009通信理论与技术新发展——第十四届全国青年通信学术会议论文集[C];2009年
相关博士学位论文 前2条
1 汪庆淼;基于目标函数的模糊聚类新算法及其应用研究[D];江苏大学;2014年
2 张健毅;大规模反钓鱼识别引擎关键技术研究[D];北京邮电大学;2012年
相关硕士学位论文 前2条
1 赵茉莉;网络爬虫系统的研究与实现[D];电子科技大学;2013年
2 王颖杰;基于恶意网页检测的蜜罐系统研究[D];南京师范大学;2008年
,本文编号:2505599
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2505599.html