基于决策树和贝叶斯算法的垃圾网页检测的研究和实现
[Abstract]:In the Internet era, search engines are facing great pressure, not only because the new pages are growing at a rate of tens of thousands every day, but also because many website operators defraud the high ranking of search engines through various illegal means. How to obtain accurate information from the vast ocean of network, filter unhealthy, illegal and useless information has become another hotspot of Internet research. Current research focuses on filtering useless information without taking into account the many unhealthy and illegal pages. Therefore, combining the advantages of decision tree and Bayesian algorithm in text classification, it is necessary to find out an intelligent algorithm that can eliminate the high ranking pages and filter the unhealthy and illegal information. Based on the above considerations, this paper first defines two kinds of spam pages, one is to improve its ranking weight in the search engine index results through some cheating means, resulting in the search engine index results accuracy reduced. Seriously affect the normal use of search engines, this part of the web page called search engine spam page; The other is that the text information expressed in the web page is unethical, legal and cultural spam information, such information may have a serious negative impact on society, this kind of web page is called bad information spam page. It is an important task for search engines to detect and filter these two kinds of spam pages from the point of view of themselves and the whole society. Based on the analysis of the current situation of spam detection algorithms, this paper combines decision tree algorithm (ID3) and Bayesian algorithm to filter these two garbage pages. The reason for the combination of the two algorithms is that through experimental analysis, it is found that although the ID3 algorithm has a very high detection accuracy for search engine spam pages, it is very difficult to capture some bad information spam pages that have the same characteristics as normal web pages. Using Bayesian algorithm can make up for the deficiency of ID3 algorithm, which is mainly because naive Bayesian classifier has high classification accuracy for content-based text. ID3 is a kind of classification algorithm based on information gain. This paper presents an improved ID3 algorithm according to the characteristics of junk web pages. The experimental results show that the new improved algorithm not only improves the accuracy of classification. At the same time, it also reduces the dimension of the feature space effectively (cutting off many unnecessary branches, which makes the algorithm more efficient). This paper also makes many detailed improvements to the basic strategy of naive Bayesian classifier in the problem of spam detection, and proposes a ASN algorithm based on 蠂 2 statistics (ASN is the abbreviation of naive Bayesian classifier for attribute selection). Through the analysis of the experimental results, the classification effect is very good, the leakage detection rate is basically controlled within 8%. In order to verify the feasibility of combining the two algorithms, a detection system is implemented in this paper. The detection accuracy of the system is (72 卤1.5)% for a single garbage page and (75 卤0.85)% for two garbage pages at the same time. The accuracy of simultaneous detection of two spam pages is significantly higher than that of the filters currently in use.
【学位授予单位】:北京工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.092
【参考文献】
相关期刊论文 前10条
1 林阳,祝智庭;国外因特网内容标记与过滤技术调查[J];电化教育研究;2002年02期
2 费宗莲;病毒防火墙的内容过滤技术[J];计算机安全;2004年04期
3 孙春来,段米毅,毛克峰;基于内容过滤的网络监控技术研究[J];高技术通讯;2001年11期
4 吕汇新;一个基于模式匹配入侵检测技术的防信息泄露系统的设计与实现[J];哈尔滨师范大学自然科学学报;2004年03期
5 樊兴华;孙茂松;;一种高性能的两类中文文本分类方法[J];计算机学报;2006年01期
6 卢军,卢显良,韩宏,任立勇;实时网络信息过滤系统的设计与实现[J];计算机应用;2002年10期
7 贾志洋;李伟伟;张海燕;;基于内容的搜索引擎垃圾网页检测[J];计算机应用与软件;2009年11期
8 王荣;;一种改进的决策树算法[J];科学技术与工程;2009年15期
9 吴瑞,周学广;网上不良信息过滤系统研究[J];信息安全与通信保密;2005年08期
10 刘永丹,曾海泉,李荣陆,胡运发;基于语义分析的倾向性文本过滤[J];通信学报;2004年07期
相关博士学位论文 前2条
1 陈景年;选择性贝叶斯分类算法研究[D];北京交通大学;2008年
2 蒋良孝;朴素贝叶斯分类器及其改进算法研究[D];中国地质大学;2009年
相关硕士学位论文 前3条
1 袁新成;基于向量空间模型的自适应文本过滤研究[D];哈尔滨工业大学;2006年
2 董梅;文本内容的信息过滤技术研究[D];合肥工业大学;2006年
3 方柯;面向网络不良文本过滤的概念网技术研究与实现[D];上海交通大学;2007年
本文编号:2419062
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2419062.html