多视图学习在垃圾网页检测中的应用研究

发布时间：2018-04-27 12:03

本文选题：多视图学习 + 垃圾网页检测　；参考：《山东师范大学》2014年硕士论文

【摘要】：现在网络极大地改变了人们表达自己和与他人互动的方式，已经成为最主要的信息检索方式。正因如此，向HTML页面或其他网络文件添加信息也变得越来越容易，同时用户就会更难分辨准确和不准确的信息或可信赖和不可靠的信息，因此创建一个有效的垃圾网页检测方法是当前面对的一大挑战。如今垃圾网页检测的主要工作在于检测基于内容作弊和链接作弊的垃圾网页。现有垃圾网页的检测方法通常利用网页单一视图的特征对其是否属于垃圾网页进行分类，而将垃圾网页两个方面的特征同时用于检测的多视图学习手段，可以使垃圾网页检测问题更为全面。本文围绕多视图学习，针对垃圾网页检测的问题，对多视图学习的特征提取方法、分类方法以及网页具体链接结构等进行研究，具体研究成果如下： (1)将垃圾网页数据集基于内容和链接的特征看作此检测问题的两个不同的视图，首先应用典型相关分析和其他改进方法提取特征，用转换矩阵得到两视图上相关性最大的投影方向的特征。然后使用不同的针对两视图特征的组合方式将两视图特征合为一个特征，使用新的单视图特征来训练分类器进行分类。实验结果显示把垃圾网页检测作为多视图分类问题即看成两个视图的数据使用典型相关分析方法，可提高分类精度。 (2)由于垃圾网页检测问题中只有少量标记网页，因此可使用半监督协同训练方法进行垃圾网页检测。将网页特征分为内容和链接两个视图。在进行具体的分类步骤之前使用独立成分分析，提取两个视图特征的独立成分，具体的分类步骤是由协同训练实现的。实验结果显示这种特征提取和半监督分类的组合能够提高垃圾网页检测精度，对两个视图分别进行独立成分分析也更为有效。 (3)利用网页链接结构修改SVM分类器，，首先利用直接链接矩阵和间接链接矩阵来构建保持链接结构的类内散布矩阵，然后将网页链接结构组合到SVM分类器中来重新配置一个优化问题。此方法在利用网页链接信息方面具有优势。垃圾网页数据集上的实验结果表明将网页链接结构与SVM分类器组合可以显著地优于其他相关方法，实验还显示了分类准确率随间接链接步长的变化。 (4)通过严密考虑内容与链接两视图特征的不同构造和统计特性来解决这个问题。分别针对内容及链接特征重构特征提取方法PCA和LPP，然后将它们组合到本文的方法中，从多视图表示的多视图嵌入中提取出一个一致的模式。通过一个迭代算法，可以求出每个视图的不同的嵌入表示以及从每个视图到一致模式的转换矩阵。同时提供了一个计算测试样本的一致模式的方法。WEBSPAM-UK2006和WEBSPAM-UK2007数据集上的实验结果显示使用一致模式来解决垃圾网页检测问题优于其他相关的降维方法。
[Abstract]:Nowadays, the Internet has greatly changed the way people express themselves and interact with others, and has become the most important way of information retrieval. As a result, it is becoming easier to add information to HTML pages or other web files, and it is becoming more difficult for users to distinguish between accurate and inaccurate information or trustworthy and unreliable information. Therefore, it is a great challenge to create an effective method for detecting spam pages. Nowadays, the main task of spam detection is to detect spam pages based on content cheating and link cheating. The existing detection methods of garbage pages usually use the features of a single view to classify whether they belong to garbage pages, while the features of the two aspects of garbage pages are used to detect the multi-view learning method at the same time. Can make the spam page detection problem more comprehensive. This paper focuses on multi-view learning, aiming at the problem of spam page detection, the feature extraction method, classification method and specific link structure of multi-view learning are studied. The specific research results are as follows: (1) considering the feature of garbage page dataset based on content and link as two different views of this detection problem, we first apply canonical correlation analysis and other improved methods to extract features. The transformation matrix is used to obtain the features of the projection direction with the greatest correlation between the two views. Then, two view features are combined into one feature by different combination methods for two view features, and a new single view feature is used to train the classifier for classification. The experimental results show that using the canonical correlation analysis method to treat garbage page detection as a multi-view classification problem can improve the classification accuracy. 2) since there are only a few tagged pages in the problem of spam page detection, semi-supervised cooperative training method can be used to detect spam pages. The page features are divided into two views: content and link. The independent component analysis (ICA) is used to extract the independent components of the two view features before the specific classification steps are implemented by cooperative training. The experimental results show that the combination of feature extraction and semi-supervised classification can improve the accuracy of garbage page detection, and the independent component analysis for the two views is also more effective. The SVM classifier is modified by using the link structure of the web page. Firstly, the direct link matrix and the indirect link matrix are used to construct the in-class scatter matrix that maintains the link structure. Then the web page link structure is combined into the SVM classifier to reconfigure an optimization problem. This method has advantages in utilizing web link information. The experimental results on the garbage data set show that the combination of the web page link structure and the SVM classifier can be significantly superior to other related methods. The experimental results also show that the classification accuracy varies with the indirect link step size. 4) this problem is solved by carefully considering the different structure and statistical characteristics of the features of the two views of content and link. The methods of feature extraction for content and link reconstruction are PCA and LPP respectively. Then they are combined into this method to extract a consistent pattern from multi-view embedding of multi-view representation. Through an iterative algorithm, the different embedded representations of each view and the transformation matrix from each view to the consistent mode can be obtained. The experimental results on WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets show that using consistent mode to solve the problem of spam detection is better than other related dimensionality reduction methods.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】