基于语义的文本聚类算法研究

发布时间：2018-04-09 17:44

本文选题：文本聚类　切入点：连续词向量　出处：《北京交通大学》2017年硕士论文

【摘要】：随着信息技术的飞速发展,网络数据呈现指数级增长。如何快速、准确地从海量网络资源中筛选出目标信息,已成为人们面临的重要问题之一。文本聚类作为涵盖了数据挖掘、机器学习和自然语言处理等领域的一种重要的文本挖掘技术在这样的时代背景下应运而生。向量空间模型由于其简便、高效性而被广泛应用于文本聚类研究中,然而,由于传统的向量空间模型直接将文本中的词作为文本表示的特征,忽略了词间可能存在的语义关系,进而导致文本语义信息丢失的问题。针对这一问题,一些学者提出通过语义消歧的方式将文本中的词映射至WordNet中与其词义对应的概念,来识别文本中的歧义词和同义词。通过对这些方法的分析,我们发现其语义消歧策略存在一些不足的地方,由此,本文提出了一种基于连续词向量的语义消歧算法,该算法探索运用神经网络语言模型来深度挖掘概念与上下文间的语义相似度信息,进而提高语义消歧的准确性。通过将该算法应用于文本聚类分析,本文实现了一种基于连续词向量语义消歧的文本聚类算法。由于本体WordNet中包含有大量的语义知识,且被以结构化的形式进行组织,一些旨在丰富文本语义表达、基于WordNet的文本表示方法被相继提出,并应用于文本聚类研究中。然而,由于文本数据语义信息的复杂、多样性,且WordNet中概念多达十万个,因此这类方法普遍存在文本向量维度过高的问题。针对这一问题,本文提出了一种基于概念簇的特征降维算法,旨在通过概念聚类来对文本进行粗粒度特征抽取,从而达到降低文本表示维度的目的。在该算法中,最棘手,同时也是最关键的一个问题是如何获取概念的语义表示,以用于后续概念聚类分析。本文基于神经网络语言模型在语义特征抽取研究中的有效性,探索将WordNet中概念间的释义关系编码至一个概念语料库中,并利用神经网络语言模型基于概念在该语料库中的共现情况来学习概念的语义表示。通过结合本文提出的基于连续词向量的语义消歧算法与基于概念簇的特征降维算法,本文实现了一种基于连续词向量和概念簇的文本聚类算法,旨在提升文本聚类准确性的同时,提高聚类算法的效率。通过与若干经典文本聚类算法的实验比较,我们发现,本文提出的文本聚类算法不仅能有效提高文本聚类的准确性,而且很好的解决了文本表示高维度问题。
[Abstract]:With the rapid development of information technology, network data presents exponential growth.How to quickly and accurately screen out the target information from massive network resources has become one of the important problems that people are facing.Text clustering is an important text mining technology which covers the fields of data mining, machine learning and natural language processing.Vector space model is widely used in text clustering research because of its simplicity and efficiency. However, because the traditional vector space model directly takes the words in the text as the feature of text representation, it ignores the semantic relations that may exist between words.Then it leads to the loss of text semantic information.To solve this problem, some scholars have proposed to identify ambiguous words and synonyms in the text by semantic disambiguation by mapping the words in the text to the concepts corresponding to their meanings in WordNet.Through the analysis of these methods, we find that there are some shortcomings in the semantic disambiguation strategy. Therefore, a semantic disambiguation algorithm based on continuous word vector is proposed in this paper.The algorithm explores the use of neural network language model to deeply mine semantic similarity information between concepts and contexts, thus improving the accuracy of semantic disambiguation.By applying this algorithm to text clustering analysis, a text clustering algorithm based on continuous word vector semantic disambiguation is implemented in this paper.Because ontology WordNet contains a lot of semantic knowledge and is organized in a structured form, some text representation methods based on WordNet have been proposed and applied to text clustering research.However, due to the complexity and diversity of semantic information of text data and the fact that there are as many as 100, 000 concepts in WordNet, this kind of method generally exists the problem of high dimension of text vector.To solve this problem, a feature reduction algorithm based on concept cluster is proposed in this paper, which aims to extract coarse-grained feature of text through concept clustering, so as to reduce the dimensionality of text representation.In this algorithm, one of the most difficult and crucial problems is how to obtain the semantic representation of concepts for subsequent conceptual clustering analysis.Based on the validity of neural network language model in semantic feature extraction, this paper explores how to encode the definitions of concepts in WordNet into a concept corpus.The neural network language model is used to study the semantic representation of concepts based on the co-occurrence of concepts in the corpus.By combining the semantic disambiguation algorithm based on continuous word vector and the feature dimension reduction algorithm based on concept cluster, a text clustering algorithm based on continuous word vector and concept cluster is implemented in this paper.In order to improve the accuracy of text clustering and improve the efficiency of clustering algorithm.By comparing with some classical text clustering algorithms, we find that the proposed text clustering algorithm can not only effectively improve the accuracy of text clustering, but also solve the problem of high dimension of text representation.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】