面向信息检索的Web文本挖掘方法研究

发布时间：2018-03-31 01:28

本文选题：Web文本挖掘　切入点：半监督学习　出处：《华南理工大学》2012年博士论文

【摘要】：今天，互联网已经成为一个大众化和交互式的信息发布媒介。Web作为一个巨大的、开放的、异构的和动态的信息容器，产生和容纳了巨大规模的文本、数据、多媒体、临时性数据等各类信息。由于资源分散且没有统一的管理和结构，这就导致相关信息的获取并非易事，人们真正感兴趣的内容常常被淹没在众多无关信息当中。通过Web数据挖掘的研究，把新的Web文本挖掘方法和技术应用到信息检索中去，利用Web文本挖掘的研究成果来提高信息检索中页面内容分类、聚类的精度和效率，改善检索结果的组织，提高Web信息查找和利用的效率，能够直接或间接地解决搜索引擎精度不高、召回率低、信息过载、返回结果组织方式有限以及服务形式单一等缺陷，为信息检索系统发展到一个新的水平提供技术支持。因此，面向信息检索的Web文本挖掘方法研究有着十分重要的理论意义和商业应用价值。目前，从面向信息检索的角度来看Web文本内容挖掘是一个非常活跃的研究方向，众多学者在这个领域进行了广泛而深入的研究，虽然取得了一些可喜的成果与应用，但远远未达到一个成熟的阶段，仍面临许多亟待解决的重要问题：至今还没有发现“最佳”的特征选择的维度削减方法；文本数据高维稀疏，传统的分类、聚类算法的精度和效率难以提高；基于小样本训练的半监督学习问题；海量数据难于查找，如何有效改善检索结果的组织、发布以方便查询浏览等等。本文在现有Web文本内容挖掘方法和研究成果的基础上，进一步围绕Web文本挖掘的关键性问题与方法展开研究。对类别不均衡的非平衡数据、在线评价这类带情感倾向的数据的特征降维问题分别给出解决方案；以半监督学习作为主要研究对象，提出了几种新的半监督学习算法，并应用到Web文本挖掘分析；同时，针对检索结果聚类分析问题提出了一种解决方法，以改善搜索结果组织。在几个常用的标准数据集上，通过相关实验进行对比分析，验证了改进方法的有效性。本文所取得的工作成果以及创新点主要体现在以下几个方面： 1.针对非平衡文本集上的分类问题，提出了一种基于Naive Bayesian的增强最大期望(Expectation Maximization, EM)半监督分类算法。首先，构建一个有效的特征选择函数来过滤掉大量无效特征词且保留高类别信息的特征项，利用该特征选择函数使得类别不均衡数据集的特征空间维度能够真正有效降低。同时，对EM算法结合朴素Bayesian分类方法进行改进调整，在每步迭代过程中将后验类别概率最高的未标注文档从未标注训练集转移至已标注集，避免干扰其它未标注样本所属类别的确定。 2.针对在线商品评价这类情感特征倾向明显的Web文本分类问题，提出了基于特征分布半监督分类算法。通过利用特征项的类别分布情况来弥补信息增益方法的不足，修正原信息增益方法的特征项和类别的联合分布概率，放大特征项在不同类别间出现的差异，调整后的信息增益方法保留真正具有较高类别区分能力的特征，达到有效降低特征空间维度的目的。再将基于特征分布的选择方法与增强EM算法相结合进行半监督文本分类，取得了较好的分类效果和性能。 3.为解决传统Web文本聚类方法精度和效率不理想的状况，提出了基于强类别特征近邻传播的半监督聚类算法。在高效、快速的近邻消息传播算法的基础上吸收半监督聚类的思想，充分利用少量已知类别标签数据中潜在的先验信息，提取强类别区分能力的特征项对训练样本的夹角余弦相似性矩阵进行调整，构建综合强类别特征和夹角余弦的相似性测度函数，在算法每轮迭代完成后进一步将类别确定性程度最高的未标记样本转移到已标注集，这些措施使得算法性能和精度都得到较大提升。 4.为提高少量带类别标签样本数据的利用效果，提出了一种融合种子扩散近邻传播的半监督聚类算法。在聚类初始阶段把少量有限的带类别标签样本作为初始种子，然后通过扩散增大规模，进一步净化、提纯后移除误标记和噪音数据，逐步将初始种子培育成规模更大的优良种子集，以改善聚类初始化效果，同时利用seeds集样本中蕴含的类别结构信息构建更合理的相似性测度，促使算法快速向正确聚类目标收敛，为处理大规模非对称性以及高维稀疏的Web文本分析问题提供了有效的解决方案。 5.为了改善Web搜索结果的组织和发布效果，方便信息查找浏览，提出了基于潜在语义信息和后缀树的Web检索结果聚类算法。该算法首先结合向量空间模型和后缀树模型的优点进行Web页面摘要片断的聚类过程，将拥有较多相同短语的页面文档组成一个基簇，，再借助潜在语义索引方法提取特征词条和文档之间蕴含的潜在语义关联信息，为页面基簇挑选与主题贴切的候选短语作为文档基簇的目录标签，聚类结果使得Web检索结果方便浏览且能协助用户快速地找到他们感兴趣的Web页面或站点信息。
[Abstract]:Today, the Internet has become a popular and interactive information dissemination media.Web as a huge, open, heterogeneous and dynamic information generating container, and contain huge text, data, multimedia, information of all kinds of temporary data. Due to scattered resources and no unified management and structure. This leads to the relevant information is not easy, people are really interested in the content is often submerged in many irrelevant information.
Through the research of Web data mining, the new Web text mining methods and techniques applied to information retrieval, text mining results using Web to improve the content of information retrieval in page classification, the accuracy and efficiency of clustering, improve the retrieval results organization, improve Web information search and utilization efficiency, can directly or indirectly to solve the search engine precision, recall rate, information overload, return results Organization Limited and single form of service defects such as information retrieval system is developed to a new level to provide technical support. Therefore, Web oriented text information retrieval method for mining research has very important theoretical significance and commercial value.
At present, from the perspective of information retrieval for Web text mining is a very active research direction, carried out extensive and in-depth study of many scholars in this field, although there has been some gratifying achievements and application, but far not reached a mature stage, is still facing many important problems to be solved. Haven't found "best choice" feature dimension reduction method; high-dimensional sparse text data, the traditional classification, it is difficult to improve the accuracy and efficiency of clustering algorithm; semi supervised learning problem of small sample based on the training data; it is difficult to find, how to effectively improve the retrieval results of the organization, to facilitate the release browsing query and so on.
This paper based mining methods and research results in the existing Web text content, and further around the key issues and methods of Web text mining research. Non equilibrium data of class imbalance, the characteristics of online evaluation of this kind of emotional tendency of data reduction are given for solutions to semi supervised learning as the main; the object of study, this paper puts forward some new semi supervised learning algorithm, and applied to Web text mining analysis; at the same time, according to the search result clustering analysis a method is proposed to improve the search results. In several commonly used standard data collection, through the analysis and comparison of the related experiments to verify the effectiveness of improvement methods.
The achievements and innovation points of this paper are mainly reflected in the following aspects:
1. for the text classification problem on the set of non balance, and presents an improved expectation maximization Naive based on Bayesian (Expectation Maximization EM) semi supervised classification algorithm. First, build an effective feature selection function to filter out a large number of invalid feature feature words and retain high category information, feature space dimension selection function the categories of imbalanced data sets using this feature can really reduce. At the same time, the EM algorithm combined with simple Bayesian classification method is improved to adjust, in each iteration process, the posterior probability of the highest category of unlabeled documents have been transferred to the unlabeled training set annotation, avoiding interference with other unlabeled samples to determine the category.
2. for goods online evaluation of this kind of emotional features tend to Web text classification problem was proposed, the distribution characteristics of semi supervised classification algorithm based on information gain method. To remedy the deficiency by using category distribution feature of the joint probability distribution of the information gain method to amend the original features and categories, features in different magnification the difference between categories, information gain adjusted retain truly has the characteristics of higher categories distinguishing ability, to effectively reduce the dimension of the feature space. Then the feature selection method based on the distribution and the enhanced EM algorithm combining semi supervised text classification, classification results are gained and good performance.
3. in order to solve the traditional Web method of text clustering precision and efficiency is the ideal situation, propose a semi supervised clustering algorithm based on strong classification features affinity propagation. In the efficient, semi supervised clustering based algorithm for fast absorption neighbor news spread on the full use of potential a few known category labels in the data prior information, feature extracting category distinguishing ability of cosine similarity matrix of training samples to adjust the similarity measure function to construct the comprehensive strong classification features and cosine of the angle, in each iteration algorithm to complete further categories of the highest degree of uncertainty will be transferred to the unlabeled samples labeled set, these measures make the algorithm performance and accuracy has been greatly improved.
4. to improve the effect of the use of a small amount of labeled samples data, put forward a kind of fusion of seed dispersal of semi supervised affinity propagation clustering algorithm. In the initial stage of a small cluster with limited labeled samples as the initial seed, then diffusion through increasing the size of further purification, after purification to remove error markers and noise data, will be gradually the initial seeds into larger seed set, in order to improve the cluster initialization effect, at the same time using the seeds set of category structure information contains sample build similarity measure is more reasonable, the algorithm quickly to the correct target clustering convergence, for the analysis of the problem provides an effective solution for large non symmetry and high dimension sparse the Web text processing.
5. in order to improve the Web search results to organize and distribute the information search effect, convenient browsing, the latent semantic information and suffix tree clustering algorithm based on Web search results. Firstly, the clustering process combines the advantages of vector space model and suffix tree model for Web page Abstract fragments, will have the same page document more phrases a base cluster, then using latent semantic indexing method to extract semantic correlation information between feature words and documents, choose appropriate candidate phrases and themes for the page based cluster as document base cluster catalogue label, which results in Web search results clustering and easy browsing can help users quickly find their interest in Web the page or site information.

【学位授予单位】：华南理工大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP311.13

【参考文献】