结合概率潜在语义分析的文本谱聚类研究

发布时间：2018-06-14 22:33

本文选题：聚类分析 + 谱聚类　；参考：《重庆大学》2012年硕士论文

【摘要】：聚类分析是当前数据挖掘领域中一个研究热点，已经被广泛应用于搜索引擎、科学数据探测、信息过滤、Web分析、图像处理等领域。谱聚类算法作为一种新颖的聚类分析算法，与传统的聚类方法相比，该方法不仅可以处理复杂的数据类型，将聚类问题转换为代数问题进行求解；而且它简单易实现，可以在任意形状簇的样本空间上进行聚类，具有辨别非凸组合的能力并能在全局上获取最优解。然而，，谱聚类算法也存在一些不足。谱聚类中的相似矩阵一般建立在向量空间模型之上，该模型忽略了词的同义、多义的问题，造成了信息的大量冗余；此外，谱聚类对高斯函数中的尺度参数十分敏感，使得谱聚类的性能很不稳定。为了解决上述问题，本文首先用概率潜在语义分析方法提取潜藏语义信息，以弥补向量空间模型中缺乏语义信息描述的缺陷；然后，利用夹角余弦计算相似度的方法构造相似矩阵，以消除尺度参数对谱聚类的影响。最后，将改进后的方法运用到文本谱聚类上。在此过程中，本文主要研究工作如下： ①分析了当前向量空间模型中存在的不足：一是该模型忽略了词语之间存在的多义性和同义性的问题，造成了特征冗余；二是由于文本特征项的高维性，在对文本数据的处理时就需要消耗大量时间；针对这些问题，提出结合概率潜在语义分析的谱聚类算法。 ②研究了谱聚类算法的有关背景理论知识和方法，并总结了谱聚类算法的一般处理过程，深入分析了谱聚类中相似矩阵的构造问题。 ③传统谱聚类算法中相似度的计算采用的是高斯函数，该方法需要根据经验人为初始化尺度参数，使得该函数具有一定的局限性，影响谱聚类的性能。本文没有专门研究对尺度参数的优化，而是用夹角余弦方法计算文本之间的相似度，来避免尺度参数人为选择引起的不足，提高谱聚类的性能。最后，在重新构造的相似矩阵上进行文本谱聚类，并进行实验分析，采用聚类准确度和互信息指标评价实验结果，在该评价指标上，本文提出的在语义空间上采用夹角余弦计算文本之间相似度的方法相比于原来的方法，得到的谱聚类效果更好、性能更稳定。结果表明本文提出的改进方法是可行性的。
[Abstract]:Clustering analysis is a research hotspot in the field of data mining, which has been widely used in search engine, scientific data detection, information filtering Web analysis, image processing and other fields. Spectral clustering algorithm is a novel clustering analysis algorithm. Compared with the traditional clustering method, this method can not only deal with complex data types and transform the clustering problem into algebraic problems, but also be simple and easy to implement. Clustering can be carried out on the sample space of arbitrary shape clusters, which has the ability to distinguish non-convex combinations and to obtain the optimal solution globally. However, the spectral clustering algorithm also has some shortcomings. The similarity matrix in spectral clustering is generally based on the vector space model, which neglects the synonyms and polysemes of words, resulting in a great deal of redundancy of information. In addition, spectral clustering is very sensitive to the scale parameters in Gao Si function. The performance of spectral clustering is unstable. In order to solve the above problems, this paper firstly uses probabilistic latent semantic analysis method to extract latent semantic information to make up for the lack of semantic information description in vector space model. In order to eliminate the influence of scale parameters on spectral clustering, the similarity matrix is constructed by using angle cosine to calculate similarity. Finally, the improved method is applied to text spectral clustering. In this process, the main work of this paper is as follows: 1. The shortcomings of the current vector space model are analyzed. One is that the model ignores the problems of polysemy and synonym between words, resulting in feature redundancy; Second, because of the high dimension of text feature, it takes a lot of time to process text data. A spectral clustering algorithm based on probabilistic latent semantic analysis is proposed. 2 the background theoretical knowledge and methods of spectral clustering algorithm are studied, and the general processing process of spectral clustering algorithm is summarized. In this paper, the problem of constructing similarity matrix in spectral clustering is deeply analyzed. (3) in the traditional spectral clustering algorithm, the similarity is calculated by Gao Si function, which needs to initialize the scale parameters according to the experience. This function has some limitations and affects the performance of spectral clustering. This paper does not focus on the optimization of scale parameters, but uses the angle cosine method to calculate the similarity between texts to avoid the shortcomings caused by the artificial selection of scale parameters and to improve the performance of spectral clustering. Finally, the text spectrum clustering is carried out on the reconstructed similarity matrix, and the experimental analysis is carried out. The experimental results are evaluated by clustering accuracy and mutual information index. Compared with the original method, the proposed method using angle cosine to calculate the similarity of text in semantic space has better spectral clustering effect and more stable performance. The results show that the improved method proposed in this paper is feasible.
【学位授予单位】：重庆大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【相似文献】