当前位置:主页 > 科技论文 > 自动化论文 >

基于代价敏感方法的垃圾网页欺诈检测

发布时间:2018-05-30 23:20

  本文选题:垃圾网页检测 + 代价敏感学习 ; 参考:《西南交通大学》2017年硕士论文


【摘要】:随着近20年互联网技术的急速发展,各式各样的网站和Web应用层出不穷,这些网站的出现给人们的生活带来了便利。与此同时,作为互联网发展的副产品,网上也存在大量含有诈骗或有害信息的垃圾网页,这些被作弊者散布的垃圾网页严重地危害着上网者的利益。如何准确地识别和检测这些垃圾网页是当前研究者所关注的热点之一。本文首先从垃圾网页二元分类检测入手,研究当垃圾网页和正常网页被错分后产生的不同代价,采用了基于代价敏感支持向量机的检测方法。在引入代价敏感方法后,针对很多方案中需要人为指定代价的问题,基于粒子群优化算法构建了融合代价计算的垃圾网页检测框架。具体做法是把代价敏感支持向量机包装为粒子群算法的适应函数,其中代价敏感分类的代价参数作为粒子群算法的寻优问题,分类算法的AUC值作为适应函数的输出。以此既保证了分类检测的性能又降低了人为因素对算法的影响。其次,本文研究了多级垃圾网页检测问题,多级检测相比二分检测更加细粒度,要求垃圾网页按不同危害度被检出。本文基于代价敏感支持向量机的“一对一”组合多元分类方法实现了多级垃圾网页检测,“一对一”组合多分类方法既保证了检测性能,又避免了代价矩阵中代价融合的问题。之后同样结合粒子群优化算法,对多个误分类代价进行计算。本文基于UK2007垃圾网页数据集的原始类标数据,构建了 MC-UK2007三类别的新数据集。之后分别使用UK2007和MC-UK2007进行融合代价计算的二分类和多分类检测实验,并应用其他算法设置了多组实验进行对比。实验结果显示本文所提的两个方法均能取得更优的AUC值,表明本文方法能够更有效地检出垃圾网页。
[Abstract]:With the rapid development of Internet technology in recent 20 years, a variety of websites and Web applications emerge in endlessly. The emergence of these websites brings convenience to people's lives. At the same time, as a by-product of the development of the Internet, there are also a large number of spam pages containing fraud or harmful information on the Internet. These spam pages spread by cheaters seriously harm the interests of Internet users. How to accurately identify and detect these spam pages is one of the hot topics that researchers pay attention to. This paper starts with the binary classification detection of garbage pages, studies the different costs when garbage pages and normal pages are misclassified, and adopts a cost-sensitive support vector machine based detection method. After introducing the cost sensitive method, aiming at the problem of artificial specified cost in many schemes, a garbage page detection framework based on particle swarm optimization (PSO) algorithm is proposed. The specific method is to package the cost sensitive support vector machine as the adaptive function of the particle swarm optimization algorithm, in which the cost parameters of the cost sensitive classification are taken as the optimization problem of the particle swarm optimization algorithm, and the AUC value of the classification algorithm is taken as the output of the fitness function. This not only ensures the performance of classification and detection, but also reduces the influence of human factors on the algorithm. Secondly, this paper studies the problem of multilevel garbage page detection. Multilevel detection is more fine-grained than binary detection, which requires garbage pages to be detected according to different hazards. In this paper, the "one to one" multivariate classification method based on the cost sensitive support vector machine is used to realize multilevel spam page detection. The "one to one" combined multiple classification method not only guarantees the detection performance, but also avoids the problem of cost fusion in the cost matrix. After that, the cost of multiple misclassification is calculated with particle swarm optimization (PSO). Based on the original class mark data of UK2007 garbage page data set, this paper constructs a new data set of three categories of MC-UK2007. After that, UK2007 and MC-UK2007 are used to carry out two-classification and multi-classification detection experiments of fusion cost calculation, and other algorithms are used to set up multi-group experiments for comparison. The experimental results show that the two methods proposed in this paper can obtain better AUC value, which indicates that the proposed method can detect garbage pages more effectively.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092;TP18

【参考文献】

相关期刊论文 前10条

1 刘汝隽;贾斌;辛阳;;基于信息增益特征选择的网络异常检测模型[J];计算机应用;2016年S2期

2 董亚楠;刘学军;李斌;;一种基于用户行为特征选择的点击欺诈检测方法[J];计算机科学;2016年10期

3 权鑫;顾韵华;郑关胜;顾彬;;一种增量式的代价敏感支持向量机[J];中国科学技术大学学报;2016年09期

4 卢晓勇;陈木生;;基于随机森林和欠采样集成的垃圾网页检测[J];计算机应用;2016年03期

5 李法良;朱焱;曾俊东;;集成PCA降维与分类算法的垃圾网页检测[J];计算机应用与软件;2014年10期

6 吕超镇;姬东鸿;吴飞飞;;基于LDA特征扩展的短文本分类[J];计算机工程与应用;2015年04期

7 刘奇旭;张辣,

本文编号:1957272


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/1957272.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户68983***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com