基于决策树和贝叶斯算法的垃圾网页检测的研究和实现

发布时间：2019-02-09 14:40

【摘要】：互联网时代，搜索引擎面临着巨大的压力，不仅由于每天新增的网页以数以万计的速度在增长，而且还因为许多网站经营者通过各种非法手段骗取搜索引擎的高排名。如何能够从浩瀚的网络海洋中获取准确的信息，过滤不健康、非法以及无用的信息成为当下互联网研究的又一个热点。目前研究的重点主要集中于过滤无用的信息，而没有考虑到其中混杂的许多不健康及非法的网页。因此结合决策树和贝叶斯算法的文本分类优势，找出一种既能排除单纯骗取高排名的网页，又能过滤传播不健康和非法信息的网页的智能算法成为时事所需。基于以上考虑，本文首先定义了两种垃圾网页，一种是通过一些作弊手段，提升其在搜索引擎索引结果中的排名权重，造成搜索引擎索引结果准确率下降，严重影响搜索引擎的正常使用，这部分网页称之为搜索引擎垃圾网页；另一种是网页中所表达的文本信息是违反道德、法律和文化的垃圾信息，这样的信息可能对社会产生严重的负面影响，这种网页称之为不良信息垃圾网页。无论站在自身或是整个社会的角度，检测并过滤这两种垃圾网页都是搜索引擎现阶段一个重要的任务。通过对垃圾网页检测算法研究现状的分析，本文结合决策树算法(ID3)和贝叶斯算法来对这两种垃圾网页进行过滤。之所以结合使用两种算法，是因为通过实验分析发现，ID3算法虽然对搜索引擎垃圾网页的检测精度非常高，但是却很难捕获到一些与正常网页特征无异的不良信息垃圾网页，使用贝叶斯算法正好可以弥补ID3算法在这方面的不足，这主要是因为朴素贝叶斯分类器对基于内容的文本有很高的分类精度。ID3是一种基于信息增益的分类算法，本身存在许多缺陷和不足，本文针对垃圾网页的特点，提出了一种改进的ID3算法，实验结果显示，新的改进算法不仅提高了分类的准确率，同时也有效降低了特征空间的维数(剪去了许多不必要的分枝，使算法的执行效率更高)。本文还对朴素贝叶斯分类器在垃圾网页检测问题上的基本策略做了许多细节性的改进，并提出了一种基于χ2统计的ASN算法(ASN是属性选择的朴素贝叶斯分类器的缩写)，，通过实验结果分析，分类效果非常好，检漏率基本控制在8%以内。为了验证结合两种算法的可行性，本文还实现了一个检测系统，该系统对单一类垃圾网页的检测精度达到(72±1.5)%，对于两种垃圾网页的同时检测精度达到(75±0.85)%，对两种垃圾网页的同时检测精度相比目前正在使用的过滤器提升效果非常明显。
[Abstract]:In the Internet era, search engines are facing great pressure, not only because the new pages are growing at a rate of tens of thousands every day, but also because many website operators defraud the high ranking of search engines through various illegal means. How to obtain accurate information from the vast ocean of network, filter unhealthy, illegal and useless information has become another hotspot of Internet research. Current research focuses on filtering useless information without taking into account the many unhealthy and illegal pages. Therefore, combining the advantages of decision tree and Bayesian algorithm in text classification, it is necessary to find out an intelligent algorithm that can eliminate the high ranking pages and filter the unhealthy and illegal information. Based on the above considerations, this paper first defines two kinds of spam pages, one is to improve its ranking weight in the search engine index results through some cheating means, resulting in the search engine index results accuracy reduced. Seriously affect the normal use of search engines, this part of the web page called search engine spam page; The other is that the text information expressed in the web page is unethical, legal and cultural spam information, such information may have a serious negative impact on society, this kind of web page is called bad information spam page. It is an important task for search engines to detect and filter these two kinds of spam pages from the point of view of themselves and the whole society. Based on the analysis of the current situation of spam detection algorithms, this paper combines decision tree algorithm (ID3) and Bayesian algorithm to filter these two garbage pages. The reason for the combination of the two algorithms is that through experimental analysis, it is found that although the ID3 algorithm has a very high detection accuracy for search engine spam pages, it is very difficult to capture some bad information spam pages that have the same characteristics as normal web pages. Using Bayesian algorithm can make up for the deficiency of ID3 algorithm, which is mainly because naive Bayesian classifier has high classification accuracy for content-based text. ID3 is a kind of classification algorithm based on information gain. This paper presents an improved ID3 algorithm according to the characteristics of junk web pages. The experimental results show that the new improved algorithm not only improves the accuracy of classification. At the same time, it also reduces the dimension of the feature space effectively (cutting off many unnecessary branches, which makes the algorithm more efficient). This paper also makes many detailed improvements to the basic strategy of naive Bayesian classifier in the problem of spam detection, and proposes a ASN algorithm based on 蠂 2 statistics (ASN is the abbreviation of naive Bayesian classifier for attribute selection). Through the analysis of the experimental results, the classification effect is very good, the leakage detection rate is basically controlled within 8%. In order to verify the feasibility of combining the two algorithms, a detection system is implemented in this paper. The detection accuracy of the system is (72 卤1.5)% for a single garbage page and (75 卤0.85)% for two garbage pages at the same time. The accuracy of simultaneous detection of two spam pages is significantly higher than that of the filters currently in use.
【学位授予单位】：北京工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.092

【参考文献】