当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于决策树和贝叶斯算法的垃圾网页检测的研究和实现

发布时间:2019-02-09 14:40
【摘要】:互联网时代,搜索引擎面临着巨大的压力,不仅由于每天新增的网页以数以万计的速度在增长,而且还因为许多网站经营者通过各种非法手段骗取搜索引擎的高排名。如何能够从浩瀚的网络海洋中获取准确的信息,过滤不健康、非法以及无用的信息成为当下互联网研究的又一个热点。目前研究的重点主要集中于过滤无用的信息,而没有考虑到其中混杂的许多不健康及非法的网页。因此结合决策树和贝叶斯算法的文本分类优势,找出一种既能排除单纯骗取高排名的网页,又能过滤传播不健康和非法信息的网页的智能算法成为时事所需。 基于以上考虑,本文首先定义了两种垃圾网页,一种是通过一些作弊手段,提升其在搜索引擎索引结果中的排名权重,造成搜索引擎索引结果准确率下降,严重影响搜索引擎的正常使用,这部分网页称之为搜索引擎垃圾网页;另一种是网页中所表达的文本信息是违反道德、法律和文化的垃圾信息,这样的信息可能对社会产生严重的负面影响,这种网页称之为不良信息垃圾网页。无论站在自身或是整个社会的角度,检测并过滤这两种垃圾网页都是搜索引擎现阶段一个重要的任务。 通过对垃圾网页检测算法研究现状的分析,本文结合决策树算法(ID3)和贝叶斯算法来对这两种垃圾网页进行过滤。之所以结合使用两种算法,是因为通过实验分析发现,ID3算法虽然对搜索引擎垃圾网页的检测精度非常高,但是却很难捕获到一些与正常网页特征无异的不良信息垃圾网页,使用贝叶斯算法正好可以弥补ID3算法在这方面的不足,这主要是因为朴素贝叶斯分类器对基于内容的文本有很高的分类精度。ID3是一种基于信息增益的分类算法,本身存在许多缺陷和不足,本文针对垃圾网页的特点,提出了一种改进的ID3算法,实验结果显示,新的改进算法不仅提高了分类的准确率,同时也有效降低了特征空间的维数(剪去了许多不必要的分枝,使算法的执行效率更高)。本文还对朴素贝叶斯分类器在垃圾网页检测问题上的基本策略做了许多细节性的改进,并提出了一种基于χ2统计的ASN算法(ASN是属性选择的朴素贝叶斯分类器的缩写),,通过实验结果分析,分类效果非常好,检漏率基本控制在8%以内。 为了验证结合两种算法的可行性,本文还实现了一个检测系统,该系统对单一类垃圾网页的检测精度达到(72±1.5)%,对于两种垃圾网页的同时检测精度达到(75±0.85)%,对两种垃圾网页的同时检测精度相比目前正在使用的过滤器提升效果非常明显。
[Abstract]:In the Internet era, search engines are facing great pressure, not only because the new pages are growing at a rate of tens of thousands every day, but also because many website operators defraud the high ranking of search engines through various illegal means. How to obtain accurate information from the vast ocean of network, filter unhealthy, illegal and useless information has become another hotspot of Internet research. Current research focuses on filtering useless information without taking into account the many unhealthy and illegal pages. Therefore, combining the advantages of decision tree and Bayesian algorithm in text classification, it is necessary to find out an intelligent algorithm that can eliminate the high ranking pages and filter the unhealthy and illegal information. Based on the above considerations, this paper first defines two kinds of spam pages, one is to improve its ranking weight in the search engine index results through some cheating means, resulting in the search engine index results accuracy reduced. Seriously affect the normal use of search engines, this part of the web page called search engine spam page; The other is that the text information expressed in the web page is unethical, legal and cultural spam information, such information may have a serious negative impact on society, this kind of web page is called bad information spam page. It is an important task for search engines to detect and filter these two kinds of spam pages from the point of view of themselves and the whole society. Based on the analysis of the current situation of spam detection algorithms, this paper combines decision tree algorithm (ID3) and Bayesian algorithm to filter these two garbage pages. The reason for the combination of the two algorithms is that through experimental analysis, it is found that although the ID3 algorithm has a very high detection accuracy for search engine spam pages, it is very difficult to capture some bad information spam pages that have the same characteristics as normal web pages. Using Bayesian algorithm can make up for the deficiency of ID3 algorithm, which is mainly because naive Bayesian classifier has high classification accuracy for content-based text. ID3 is a kind of classification algorithm based on information gain. This paper presents an improved ID3 algorithm according to the characteristics of junk web pages. The experimental results show that the new improved algorithm not only improves the accuracy of classification. At the same time, it also reduces the dimension of the feature space effectively (cutting off many unnecessary branches, which makes the algorithm more efficient). This paper also makes many detailed improvements to the basic strategy of naive Bayesian classifier in the problem of spam detection, and proposes a ASN algorithm based on 蠂 2 statistics (ASN is the abbreviation of naive Bayesian classifier for attribute selection). Through the analysis of the experimental results, the classification effect is very good, the leakage detection rate is basically controlled within 8%. In order to verify the feasibility of combining the two algorithms, a detection system is implemented in this paper. The detection accuracy of the system is (72 卤1.5)% for a single garbage page and (75 卤0.85)% for two garbage pages at the same time. The accuracy of simultaneous detection of two spam pages is significantly higher than that of the filters currently in use.
【学位授予单位】:北京工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.092

【参考文献】

相关期刊论文 前10条

1 林阳,祝智庭;国外因特网内容标记与过滤技术调查[J];电化教育研究;2002年02期

2 费宗莲;病毒防火墙的内容过滤技术[J];计算机安全;2004年04期

3 孙春来,段米毅,毛克峰;基于内容过滤的网络监控技术研究[J];高技术通讯;2001年11期

4 吕汇新;一个基于模式匹配入侵检测技术的防信息泄露系统的设计与实现[J];哈尔滨师范大学自然科学学报;2004年03期

5 樊兴华;孙茂松;;一种高性能的两类中文文本分类方法[J];计算机学报;2006年01期

6 卢军,卢显良,韩宏,任立勇;实时网络信息过滤系统的设计与实现[J];计算机应用;2002年10期

7 贾志洋;李伟伟;张海燕;;基于内容的搜索引擎垃圾网页检测[J];计算机应用与软件;2009年11期

8 王荣;;一种改进的决策树算法[J];科学技术与工程;2009年15期

9 吴瑞,周学广;网上不良信息过滤系统研究[J];信息安全与通信保密;2005年08期

10 刘永丹,曾海泉,李荣陆,胡运发;基于语义分析的倾向性文本过滤[J];通信学报;2004年07期

相关博士学位论文 前2条

1 陈景年;选择性贝叶斯分类算法研究[D];北京交通大学;2008年

2 蒋良孝;朴素贝叶斯分类器及其改进算法研究[D];中国地质大学;2009年

相关硕士学位论文 前3条

1 袁新成;基于向量空间模型的自适应文本过滤研究[D];哈尔滨工业大学;2006年

2 董梅;文本内容的信息过滤技术研究[D];合肥工业大学;2006年

3 方柯;面向网络不良文本过滤的概念网技术研究与实现[D];上海交通大学;2007年



本文编号:2419062

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2419062.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户5f622***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com