基于遗传规划和集成学习的Web Spam检测关键技术研究
[Abstract]:With the explosive growth of information on the network, search engines have become an important tool in daily life to help people find information they want. A given query, a search engine usually can return thousands of pages, but most users read only a few before, so it is very important to rank in the search engine. Many people use some means to cheat the search engine sorting algorithm, so that some web pages get undue high sort values to attract users' attention, so as to achieve the purpose of gaining a certain interest. All the frauds trying to increase the sort of the web page in the search engine are called the Web Spam (Network cheating).Web Spam, which has severely reduced the search. The quality of the engine retrieval results, the user has a huge obstacle in the process of obtaining information and produces a poor user experience. For the search engine, even if these cheating pages are not enough to come to disrupt the user, it is necessary to capture, index, and store these web pages. Therefore, the identification of Web Spam has become a heavy search engine. One of the challenges.
According to the characteristics of Web Spam dataset, this paper studies the construction of classifier based on Web page features to detect Web Spam. The main work includes the following three aspects:
(1) A method for Web Spam detection based on learning discriminant function of genetic programming is proposed.
The individual is defined as a discriminant function for detecting Web Spam. Through genetic manipulation, genetic programming can find an optimized discriminant function to improve the detection performance of Web Spam. However, there will be a problem when using genetic programming to produce a discriminant function, because there is no prior knowledge of the optimal solution, so it is difficult to know the appropriate individual. Length, if the length of the individual is too short, the characteristics contained in the individual will be very few, the individual's discrimination is not high, the classification performance of the corresponding function expression is not good. To make full use of the contents of the Web Spam data set, link and other characteristics, it needs a longer discriminant function, the size of the individual is larger. For the larger individual, it is made up of a large scale individual. The time required for a population, construction and search is longer. Based on the principle that a longer discriminant function is composed of several shorter discriminant functions, this paper proposes a genetic programming learning discriminant function to detect Web Spam. This method first creates a number of populations with a number of small individuals, each of which produces the best of the population through genetic manipulation. A better discriminant function is obtained by combining the best individuals of each population through genetic programming, so that a longer discriminant function with better performance can be generated by using a shorter time to detect Web Spam.. The influence of the two forked tree depth on the evolutionary process of genetic programming and the effect of the combination are also studied. Rate.
Experiments on the WEBSPAM-UK2006 data set show that compared with the single population genetic programming, the recall rate can be increased by 5.6%, the F measure is improved by 2.25%. The recall rate of the new method is increased by 26%, the F measure is increased by 11%, and the accuracy is improved by 4%., compared with the two combination genetic programming.
(2) An ensemble learning method based on genetic programming is proposed to detect Web Spam.
At present, most of the methods based on the classification detection Web Spam only use one kind of classification algorithm to construct a classifier, and most of them ignore the imbalance between the sample and the cheating sample in the data set, that is, the normal sample is much more than the cheating sample. Because there are many different types of Web Spam technology, the new type of Spam technology is also coming out At present, it is impossible to find a universal classifier to detect all types of WebSpam. Therefore, it is an effective method to find an enhanced classifier to detect Web Spam by integrating the detection results of multiple classifiers. And integrated learning is also one of the effective methods to solve the problem of non balanced dataset classification. How to generate a variety of base classifiers and how to combine their classification results are two key problems. This paper proposes to use genetic programming based integrated learning to detect Web Spam. Firstly, different classification algorithms are used to train different base classifiers on different sample sets and feature sets, and then use heredity. A novel classifier is obtained by programming learning, which gives the final test results based on the test results of multiple base classifiers.
According to the characteristics of the Web Spam data set, the method produces different base classifiers with different data sets and classification algorithms, and integrates the results of the base classifier by genetic programming. It is not only easy to integrate the results of different types of classifiers, improve the classification performance, but also select some base classifiers for integration and reduce the prediction. This method can also integrate undersampling and integrated learning to improve the classification performance of non balanced data sets. In order to verify the effectiveness of genetic programming integration methods, experiments are carried out on balanced data sets and non balanced datasets respectively. In the experiment part of the balanced dataset, the classification algorithm and feature set pair are analyzed. The effects of integration are compared with the known integrated learning algorithms, and the results show that the accuracy, recall, F-, accuracy, error rate and AUC are superior to some known integrated learning algorithms; experiments on nonbalanced datasets show that genetic programming integration can improve the score of genetic programming. Class performance, and heteromorphic integration is more effective than homomorphic integration; genetic programming integration is better than AdaBoost, Bagging, RandomForest, majority voting integration, EDKC algorithm and Prediction Spamicity based methods to achieve higher F- metrics.
(3) A new feature detection method based on genetic programming for Web Spam is proposed.
Features play a very important role in the classification. The Web Spam data set has 96 content features, 41 link features and 138 conversion link features, of which 138 conversion link features are simple combinations or logarithmic operations of 41 link features. These features not only need to be completed by experts, but also very expensive and not easy to do. Combining the characteristics of different types (such as content features and link features), this method proposes to use genetic programming to combine the existing features to produce new features with more distinct forces, and then use these new features as input to detect the experimental display of Web Spam. on the WEBSPAM-UK2006 dataset and use 10 new ones. The classification result of the feature classifier is better than that of the original 41 link features, and the performance of the classifier is comparable to that of the 138 transform link features.
【学位授予单位】:山东大学
【学位级别】:博士
【学位授予年份】:2012
【分类号】:TP18;TP391.3
【参考文献】
相关期刊论文 前9条
1 赵强利;蒋艳凰;徐明;;选择性集成算法分类与比较[J];计算机工程与科学;2012年02期
2 张春霞;张讲社;;选择性集成学习算法综述[J];计算机学报;2011年08期
3 武磊;高斌;李京;;基于结构信息和时域信息的垃圾网页检测技术[J];计算机应用研究;2008年04期
4 余慧佳;刘奕群;张敏;茹立云;马少平;;基于大规模日志分析的搜索引擎用户行为分析[J];中文信息学报;2007年01期
5 余慧佳;刘奕群;张敏;马少平;茹立云;;基于目的分析的作弊页面分类[J];中文信息学报;2009年02期
6 杨明;尹军梅;吉根林;;不平衡数据分类方法综述[J];南京师范大学学报(工程技术版);2008年04期
7 贺志明;王丽宏;张刚;程学旗;;一种抵抗链接作弊的PageRank改进算法[J];中文信息学报;2012年05期
8 丁岳伟;王虎林;;降级Web Spam的可信度链接分析算法[J];计算机工程与设计;2009年10期
9 曾刚;李宏;;一个基于现实世界的大型Web参照数据集——UK2006 Datasets的初步研究[J];企业技术开发;2009年05期
相关会议论文 前1条
1 李智超;余慧佳;马少平;;使用支持向量机进行作弊页面识别[A];第三届全国信息检索与内容安全学术会议论文集[C];2007年
相关博士学位论文 前4条
1 李军;不平衡数据学习的研究[D];吉林大学;2011年
2 赵强利;基于选择性集成的在线机器学习关键技术研究[D];国防科学技术大学;2010年
3 陈海霞;面向数据挖掘的分类器集成研究[D];吉林大学;2006年
4 谢元澄;分类器集成研究[D];南京理工大学;2009年
相关硕士学位论文 前3条
1 冯东庆;基于链接分析的网页排序作弊检测方法研究[D];吉林大学;2011年
2 孙丽娜;集成异种分类器分类稀有类[D];郑州大学;2007年
3 韩博;反搜索引擎作弊中种子集合自动扩展算法研究[D];大连理工大学;2009年
,本文编号:2174959
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2174959.html