当前位置:主页 > 管理论文 > 移动网络论文 >

基于数据挖掘的恶意网站检测技术研究

发布时间:2019-06-25 10:03
【摘要】:随着互联网的发展,网络安全日益受到人们关注。恶意网站攻击事件的频繁发生,给用户带来了巨大的财产损失,同时也严重威胁了个人甚至国家的安全。因此,建立一定的模型,并对恶意网站进行识别和检测具有非常重要的意义。目前国内外很多学者对特征选择方法进行了改进,他们多集中在对主机特征与词汇特征两个方面进行深入挖掘与改进,但是仍然存在准确率与效率不高的情况。针对这些问题,在特征提取问题上,本文首先提出了建立易受攻击网站名单的概念,并在此基础上提出了基于加权距离的新特征提取方案。同时在数据挖掘算法上本文基于改进的模糊C均值聚类算法对KNN模型进行改进,提高了模型的效率。本文的研究工作主要包括:数据采集:本文对正常网站和恶意网站的数据分别进行爬取,清洗,标准化处理与入库操作,最终把数据放到MySQL数据库中。特征提取:异于常见的网站白名单、网站黑名单的概念,文中把容易被攻击的的网站进行汇总,提出了建立易受攻击网站名单的概念。同时恶意网站通常在正常网站的基础上进行一定程度的更改,根据更改类型设定不同的权重,提出了加权距离的概念,对任一输入URL计算其与易受攻击网站名单中URL间的最近加权距离距离,并把它作为新的特征。模型改进:本文首先对KNN算法和模糊C均值算法进行了改进,针对FCM初始聚类中心不确定,容易陷入局部最优的缺点,本文提出了坐标密度法,确定初始聚类中心。针对FCM算法的初始聚类个数随机选取的问题提出了运用K值和数据集个数来确定的方法,最终获取样本的聚类中心和聚类中心所在的簇。通过找到距离测试集距离最小的聚类中心所在簇,来确定测试集的类别。模型验证:本文采用了 LR模型,J48模型以及改进的KNN模型,运用WEKA对数据进行分类。同时把加入新特征的数据和运用原始特征的数据运用数据挖掘算法进行分类及准确性对比,最终,分类结果得到一定提高。同时和其他文献中方法进行对比,发现特征具有较好的效果。
[Abstract]:With the development of the Internet, network security has been paid more and more attention. The frequent occurrence of malicious website attacks has brought huge property losses to users, but also seriously threatened the security of individuals and even countries. Therefore, it is of great significance to establish a certain model and identify and detect malicious websites. At present, many scholars at home and abroad have improved the feature selection methods, most of them focus on the host features and lexical features of the two aspects of in-depth mining and improvement, but there are still low accuracy and efficiency. In order to solve these problems, in this paper, the concept of establishing the list of vulnerable websites is proposed, and a new feature extraction scheme based on weighted distance is proposed. At the same time, in the data mining algorithm, this paper improves the KNN model based on the improved fuzzy C-means clustering algorithm, and improves the efficiency of the model. The research work of this paper mainly includes: data acquisition: this paper crawls, cleans, standardizes and stores the data of normal website and malicious website respectively, and finally puts the data into MySQL database. Feature extraction: different from the common concepts of website whitelist and website blacklist, this paper summarizes the vulnerable websites and puts forward the concept of establishing vulnerable website lists. At the same time, malicious websites usually change to a certain extent on the basis of normal websites. According to the different weights of the change types, the concept of weighted distance is put forward, and the nearest weighted distance between malicious websites and URL in the list of vulnerable sites is calculated for any input URL, and it is regarded as a new feature. Model improvement: in this paper, the KNN algorithm and fuzzy C-means algorithm are improved. In order to solve the problem that the initial clustering center of FCM is uncertain and easy to fall into local optimization, the coordinate density method is proposed to determine the initial clustering center. In order to solve the problem of random selection of the initial clustering number of FCM algorithm, a method is proposed to determine the K value and the number of data sets. Finally, the clustering center of the sample and the cluster in which the clustering center is located are obtained. By finding the cluster with the smallest distance from the test set, the category of the test set is determined. Model verification: in this paper, LR model, J48 model and improved KNN model are used to classify the data by WEKA. At the same time, the data with new features and the data using original features are compared with the data mining algorithm. Finally, the classification results are improved to a certain extent. At the same time, compared with other methods in the literature, it is found that the characteristics have better results.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092;TP311.13

【参考文献】

相关期刊论文 前10条

1 周庆平;谭长庚;王宏君;湛淼湘;;基于聚类改进的KNN文本分类算法[J];计算机应用研究;2016年11期

2 陈庄;刘龙飞;;融合域名注册信息的恶意网站检测方法研究[J];计算机光盘软件与应用;2015年01期

3 曹玖新;董丹;毛波;王田峰;;基于URL特征的Phishing检测方法(英文)[J];Journal of Southeast University(English Edition);2013年02期

4 李洋;刘飚;封化民;;基于机器学习的网页恶意代码检测方法[J];北京电子科技学院学报;2012年04期

5 刘喜梅;雷达;;一种改进的模糊C均值聚类算法[J];青岛科技大学学报(自然科学版);2011年02期

6 胡明;刘嘉勇;刘亮;;一种基于代码特征的网页木马改良模型研究[J];通信技术;2010年08期

7 张孝飞;黄河燕;;一种采用聚类技术改进的KNN文本分类方法[J];模式识别与人工智能;2009年06期

8 吕晓燕;罗立民;李祥生;;FCM算法的改进及仿真实验研究[J];计算机工程与应用;2009年20期

9 张慧哲;王坚;;基于初始聚类中心选取的改进FCM聚类算法[J];计算机科学;2009年06期

10 吴润浦;方勇;吴少华;;基于统计与代码特征分析的网页木马检测模型[J];信息与电子工程;2009年01期

相关会议论文 前1条

1 刘琪;牛文静;;正则表达式在恶意代码动态分析中的应用[A];2009通信理论与技术新发展——第十四届全国青年通信学术会议论文集[C];2009年

相关博士学位论文 前2条

1 汪庆淼;基于目标函数的模糊聚类新算法及其应用研究[D];江苏大学;2014年

2 张健毅;大规模反钓鱼识别引擎关键技术研究[D];北京邮电大学;2012年

相关硕士学位论文 前2条

1 赵茉莉;网络爬虫系统的研究与实现[D];电子科技大学;2013年

2 王颖杰;基于恶意网页检测的蜜罐系统研究[D];南京师范大学;2008年



本文编号:2505599

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2505599.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户a1d3c***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com