基于语义相似度的群智能文本聚类方法研究

发布时间：2018-04-13 09:19

本文选题：文本聚类 + 语义相似度　；参考：《江苏科技大学》2012年硕士论文

【摘要】：当今世界正处于一个信息爆炸的时代，用户查询信息时常常被信息淹没，迷失在信息中，这大大降低了检索的效率。如何快速高效的进行信息的分类管理，为用户提供准确有用的信息，是一个需要迫切解决的问题。在这样的背景下，，文本挖掘技术正受到越来越多的关注。文本聚类是文本挖掘的一个重要组成部分，它是聚类方法在文本处理领域的重要应用。文本聚类由于不需要类别的信息，能自动完成文本分组，已经得到广泛应用，如多文档自动文摘系统、搜索引擎、数字图书馆等。目前大部分聚类算法是基于向量空间模型的，这使得文本聚类面临高维性、高稀疏性和忽略语义信息的问题，影响了算法的性能和准确性。本文首先介绍了文本聚类中一些概念和方法，包括文本间距离计算、文本表示模型、文本预处理过程、聚类效果评价和常用的聚类算法等；接着介绍了《知网》的组织结构、相关概念和语义相似度的计算方法，同时也提出了一种改进的文本间相似度计算方法，将其和K-均值算法结合，通过实验数据，证明了该方法的正确性；最后介绍了本文的两种群智能算法，并提出了本文的基于语义相似度的群智能文本聚类算法。在文本预处理的特征提取阶段计算权重时，既考虑了词频和文档频率，也结合了词的词性和词在文本中的位置这两个因素；针对向量空间模型忽略了词的语义信息的问题，本文利用《知网》，通过词的语义信息，来计算文本相似度；针对前人研究的成果，提出了本文的算法，它是在文本相似度的基础上，融合了K-均值算法、蚁群算法和模拟退火算法，利用了它们各自的优点，避免了各自的缺点，通过实验数据，可以看出该算法的有效性。
[Abstract]:Nowadays, the world is in an era of information explosion, when users search for information, they are often submerged by information and lost in information, which greatly reduces the efficiency of retrieval.How to manage information quickly and efficiently and provide users with accurate and useful information is an urgent problem.In this context, text mining technology is getting more and more attention.Text clustering is an important part of text mining, it is an important application of clustering method in the field of text processing.Text clustering has been widely used in many fields, such as multi-document automatic abstracting system, search engine, digital library and so on.At present, most clustering algorithms are based on vector space model, which makes text clustering face the problems of high dimension, high sparsity and neglecting semantic information, which affects the performance and accuracy of the algorithm.This paper first introduces some concepts and methods of text clustering, including text distance calculation, text representation model, text preprocessing process, clustering effect evaluation and common clustering algorithms, etc.At the same time, an improved method for calculating the similarity between texts is proposed, which is combined with the K-means algorithm. The experimental data show that the method is correct.Finally, this paper introduces two kinds of intelligent algorithms, and proposes a text clustering algorithm based on semantic similarity.When calculating the weight in the feature extraction stage of text preprocessing, we not only consider the word frequency and document frequency, but also combine the word's part of speech and the word's position in the text, and ignore the semantic information of the word in the vector space model.In this paper, the text similarity is calculated by the semantic information of words, and the algorithm of this paper is proposed, which is based on the similarity of the text, and combines the K-means algorithm, which is based on the text similarity.Ant colony algorithm and simulated annealing algorithm take advantage of their respective advantages and avoid their shortcomings. The validity of the algorithm can be seen from the experimental data.
【学位授予单位】：江苏科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【相似文献】