基于遗传规划和集成学习的Web Spam检测关键技术研究

发布时间：2018-08-09 18:40

【摘要】：随着网络上的信息呈爆炸式增长,搜索引擎已成为日常生活中帮助人们发现其想要信息的重要工具。给定一个确定的查询,搜索引擎通常能返回成千上万个网页,但是大部分用户只读前几个,所以在搜索引擎中网页排名非常重要。因此,许多人采用一些手段来欺骗搜索引擎排序算法,使一些网页获得不应有的高排序值来吸引用户的关注,从而达到获取某方面利益的目的。所有试图增加网页在搜索引擎中排序的欺诈行为被称为Web Spam(网络作弊)。Web Spam严重降低了搜索引擎检索结果的质量,使用户在获取信息的过程中遇到巨大障碍,产生较差的用户体验。对于搜索引擎而言,即使这些作弊网页没有排得足够靠前来扰乱用户,抓取、索引和存储这些网页也需要成本。因此,识别Web Spam已成为搜索引擎的重要挑战之一。本文根据Web Spam数据集的特点,围绕基于网页特征构建分类器检测Web Spam方面进行了研究,主要工作包括以下三方面： (1)提出基于遗传规划学习判别函数检测Web Spam的方法将个体定义为检测Web Spam的判别函数,经过遗传操作,遗传规划就可以找到优化的判别函数来提高Web Spam的检测性能。然而,使用遗传规划产生判别函数时会出现一个问题,因为没有关于最优解的任何先验知识,所以很难知道个体的适当长度,如果个体长度太短,则个体中所包含的特征就会很少,个体的辨别力不高,对应函数表达式的分类性能就不好。要想充分利用Web Spam数据集中的内容、链接等特征,需要较长的判别函数,对应个体规模较大。对于由较大规模个体组成的种群,构造和搜索所需时间较长。基于较长判别函数是由若干较短判别函数组成的这一原理,本文提出通过遗传规划学习判别函数检测Web Spam,该方法先使用若干小规模的个体创建多个种群,每个种群经过遗传操作产生本种群的最好个体,然后再将每个种群所得的最好个体通过遗传规划进行组合得到更好的判别函数,从而利用较短时间就能产生性能更好的较长判别函数来检测Web Spam。本文还研究了表示个体的二叉树深度在遗传规划进化过程中的影响以及组合的效率。在WEBSPAM-UK2006数据集上进行了实验,实验结果表明,与单种群遗传规划相比,使用两次组合的多种群遗传规划能将召回率提高5.6%,F度量提高2.25%,正确率提高2.83%。与SVM相比,新方法将召回率提高了26%,F度量提高了11%,精确度提高了4%。 (2)提出利用基于遗传规划的集成学习检测Web Spam的方法。目前多数基于分类检测Web Spam的方法只使用一种分类算法构造一个分类器,并且大都忽略了数据集中作弊样本和正常样本的不平衡性,即正常样本比作弊样本多很多。由于存在多种不同类型的Web Spam技术,新类型的Spam技术也在不断出现,期望发现一个万能分类器来检测所有类型的WebSpam是不可能的。所以,通过集成多个分类器的检测结果来找到增强分类器用于检测Web Spam是一种有效方法,并且集成学习也是解决非平衡数据集分类问题的有效方法之一。在集成学习中,如何产生多样的基分类器和如何组合它们的分类结果是两个关键的问题。本文提出利用基于遗传规划的集成学习来检测Web Spam,首先使用不同的分类算法分别在不同的样本集和特征集上进行训练产生多样的基分类器,然后使用遗传规划学习得到一个新颖的分类器,由它基于多个基分类器的检测结果给出最终检测结果。该方法根据Web Spam数据集的特点,利用不同的数据集合和分类算法产生差异较大的基分类器,利用遗传规划对基分类器的结果进行集成,不仅易于集成不同类型分类器的结果,提高分类性能,还能选择部分基分类器用于集成,降低预测时间。该方法还可以将欠抽样技术和集成学习融合起来提高非平衡数据集的分类性能。为了验证遗传规划集成方法的有效性,分别在平衡数据集和非平衡数据集上进行了实验。在平衡数据集的实验部分,首先分析了分类算法和特征集合对集成的影响,然后将其与已知集成学习算法进行比较,结果显示在准确率、召回率、F-度量、精确度,错误率和AUC方面,优于一些已知的集成学习算法；在非平衡数据集上的实验表明无论是同态集成还是异态集成,遗传规划集成均能提高分类的性能,且异态集成比同态集成更加有效；遗传规划集成比AdaBoost、Bagging、RandomForest、多数投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。 (3)提出基于遗传规划产生新特征检测Web Spam的方法。特征在分类中扮演着很重要的角色,Web Spam数据集中有96个内容特征、41个链接特征和138个转换链接特征,其中138个转换链接特征是41个链接特征的简单组合或对数操作,这些特征的产生不仅需要由专家来完成,还很耗费人力,并且也不易把不同类型(如内容特征和链接特征)的特征融合在一起。该方法提出利用遗传规划将已有特征进行组合从而产生更有区别力的新特征,然后将这些新特征作为分类器的输入来检测Web Spam。在WEBSPAM-UK2006数据集上的实验显示,使用10个新特征的分类器的分类结果好于使用原41个链接特征的分类器,与使用138个转换链接特征的分类器的性能相当。
[Abstract]:With the explosive growth of information on the network, search engines have become an important tool in daily life to help people find information they want. A given query, a search engine usually can return thousands of pages, but most users read only a few before, so it is very important to rank in the search engine. Many people use some means to cheat the search engine sorting algorithm, so that some web pages get undue high sort values to attract users' attention, so as to achieve the purpose of gaining a certain interest. All the frauds trying to increase the sort of the web page in the search engine are called the Web Spam (Network cheating).Web Spam, which has severely reduced the search. The quality of the engine retrieval results, the user has a huge obstacle in the process of obtaining information and produces a poor user experience. For the search engine, even if these cheating pages are not enough to come to disrupt the user, it is necessary to capture, index, and store these web pages. Therefore, the identification of Web Spam has become a heavy search engine. One of the challenges.
According to the characteristics of Web Spam dataset, this paper studies the construction of classifier based on Web page features to detect Web Spam. The main work includes the following three aspects:
(1) A method for Web Spam detection based on learning discriminant function of genetic programming is proposed.
The individual is defined as a discriminant function for detecting Web Spam. Through genetic manipulation, genetic programming can find an optimized discriminant function to improve the detection performance of Web Spam. However, there will be a problem when using genetic programming to produce a discriminant function, because there is no prior knowledge of the optimal solution, so it is difficult to know the appropriate individual. Length, if the length of the individual is too short, the characteristics contained in the individual will be very few, the individual's discrimination is not high, the classification performance of the corresponding function expression is not good. To make full use of the contents of the Web Spam data set, link and other characteristics, it needs a longer discriminant function, the size of the individual is larger. For the larger individual, it is made up of a large scale individual. The time required for a population, construction and search is longer. Based on the principle that a longer discriminant function is composed of several shorter discriminant functions, this paper proposes a genetic programming learning discriminant function to detect Web Spam. This method first creates a number of populations with a number of small individuals, each of which produces the best of the population through genetic manipulation. A better discriminant function is obtained by combining the best individuals of each population through genetic programming, so that a longer discriminant function with better performance can be generated by using a shorter time to detect Web Spam.. The influence of the two forked tree depth on the evolutionary process of genetic programming and the effect of the combination are also studied. Rate.
Experiments on the WEBSPAM-UK2006 data set show that compared with the single population genetic programming, the recall rate can be increased by 5.6%, the F measure is improved by 2.25%. The recall rate of the new method is increased by 26%, the F measure is increased by 11%, and the accuracy is improved by 4%., compared with the two combination genetic programming.
(2) An ensemble learning method based on genetic programming is proposed to detect Web Spam.
At present, most of the methods based on the classification detection Web Spam only use one kind of classification algorithm to construct a classifier, and most of them ignore the imbalance between the sample and the cheating sample in the data set, that is, the normal sample is much more than the cheating sample. Because there are many different types of Web Spam technology, the new type of Spam technology is also coming out At present, it is impossible to find a universal classifier to detect all types of WebSpam. Therefore, it is an effective method to find an enhanced classifier to detect Web Spam by integrating the detection results of multiple classifiers. And integrated learning is also one of the effective methods to solve the problem of non balanced dataset classification. How to generate a variety of base classifiers and how to combine their classification results are two key problems. This paper proposes to use genetic programming based integrated learning to detect Web Spam. Firstly, different classification algorithms are used to train different base classifiers on different sample sets and feature sets, and then use heredity. A novel classifier is obtained by programming learning, which gives the final test results based on the test results of multiple base classifiers.
According to the characteristics of the Web Spam data set, the method produces different base classifiers with different data sets and classification algorithms, and integrates the results of the base classifier by genetic programming. It is not only easy to integrate the results of different types of classifiers, improve the classification performance, but also select some base classifiers for integration and reduce the prediction. This method can also integrate undersampling and integrated learning to improve the classification performance of non balanced data sets. In order to verify the effectiveness of genetic programming integration methods, experiments are carried out on balanced data sets and non balanced datasets respectively. In the experiment part of the balanced dataset, the classification algorithm and feature set pair are analyzed. The effects of integration are compared with the known integrated learning algorithms, and the results show that the accuracy, recall, F-, accuracy, error rate and AUC are superior to some known integrated learning algorithms; experiments on nonbalanced datasets show that genetic programming integration can improve the score of genetic programming. Class performance, and heteromorphic integration is more effective than homomorphic integration; genetic programming integration is better than AdaBoost, Bagging, RandomForest, majority voting integration, EDKC algorithm and Prediction Spamicity based methods to achieve higher F- metrics.
(3) A new feature detection method based on genetic programming for Web Spam is proposed.
Features play a very important role in the classification. The Web Spam data set has 96 content features, 41 link features and 138 conversion link features, of which 138 conversion link features are simple combinations or logarithmic operations of 41 link features. These features not only need to be completed by experts, but also very expensive and not easy to do. Combining the characteristics of different types (such as content features and link features), this method proposes to use genetic programming to combine the existing features to produce new features with more distinct forces, and then use these new features as input to detect the experimental display of Web Spam. on the WEBSPAM-UK2006 dataset and use 10 new ones. The classification result of the feature classifier is better than that of the original 41 link features, and the performance of the classifier is comparable to that of the 138 transform link features.
【学位授予单位】：山东大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP18;TP391.3

【参考文献】