基于维基百科的短文本相关度计算

发布时间：2018-01-06 17:34

本文关键词：基于维基百科的短文本相关度计算　出处：《太原理工大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着移动通信技术与社交媒体的发展,中文短文本形式的信息已渗透在社会和生活的各个领域。巨大信息量的增长也催生出巨大的使用价值,如何挖掘出这些文本的深层价值成为了一个热门话题。因此自然语言处理成为了研究者的研究热点。语义相关度计算作为自然语言处理领域一项基本性的研究工作,被广泛地应用于查询扩展、词义消歧、机器翻译、知识抽取、自动纠错等领域。而短文本作为一种新兴的文本信息源,字数较少,所表述的概念信号弱、特征信息模糊,因而难以抽取有效的特征信息。鉴于短文本所表达的信息有限,因此需要大量的背景知识来对样本特征进行扩展。维基百科作为目前世界上最大的、多语种的、开放式的在线百科全书,得到很多研究者的青睐,因此本文选择中文维基百科作为外部语料库,维基百科的结构信息以及语义信息也为短文本语义分析提供了基础。本文将短文本分为词语和句子两部分,首先提出了一种基于维基百科的词语间相关度的计算方法。该方法主要结合维基百科中的结构信息及语义信息,维基百科的主要结构包括分类体系结构、摘要中的链接结构、正文中链接结构以及重定向消歧页等,提出一种综合类别相关度与链接相关度的计算词语间相关度的方法。为了探究词语语义深层信息,提出了利用关联规则计算词语相关度的计算方法。在此基础上,本文提出了句子间相关度的计算方法,主要从三大方面入手:句子结构间的相关度计算、基于词对的相关度计算以及利用聚类对主题词加权的聚类相关度计算。其中,句子结构又包括两方面:词形和词序。在词形相关度计算上,主要通过计算词共现的频率来体现;在词序计算上,通过逆序数的计算来体现。基于词对的相关度计算主要考虑句子中词语的深度语义信息,更符合人类主观认识。聚类主要是将语义相关的词语或文本聚为一类或一簇,本文将其利用到句子间相关度的计算上,提高句子相关度计算的准确率。在理论方法成型的基础上,完成实验方案的设计。首先,下载处理中文维基百科语料;其次完成词语以及句子间相关度的计算;最后将计算结果与人工标注集进行对比,本实验选用了人工翻译Word Similarity-353测试集以及国防科技大学所统计的Words-240作为词语相关度的测试集,句子相关度的测试集选择中国数据库万维网知识提取大赛所提供的短文本语义相关度比赛评测数据集,通过对比Spearman参数和准确率等相关系数,在词语相关度计算方面,本文方法的Spearman参数比传统算法提高2.8%,句子相关度准确率达到73.3%,取得较好实验效果。证明了本文方法的合理性和实用性。
[Abstract]:With the rapid development of mobile communication technology and social media, Chinese short text information has penetrated in all fields of society and life. The large amount of information growth has also spawned a huge value, how to dig out the deep value of these texts has become a hot topic. Therefore, Natural Language Processing has become a research hotspot of researchers. The research work of semantic relevance calculation as a basic Natural Language Processing field, is widely used in word sense disambiguation, query expansion, Machine Translation, knowledge extraction, automatic error correction and other fields. And this essay as a new text information source, fewer words, concepts expressed in the weak signal, fuzzy feature information, feature so it is difficult to extract effective information. In view of the expression of short text information is limited, so a lot of background knowledge need to be extended to the wiki hundred sample characteristics. At present, as the world's largest, multilingual, open online encyclopedia, by many researchers of all ages, so this thesis chooses Chinese Wikipedia as an external corpus, provides the basis of the structure of Wikipedia information and semantic information for short text semantic. The short text is divided into two parts: words and sentences, first of all based on Wikipedia word correlation calculation method. This method is based on the structural information and semantic information in Wikipedia, Wikipedia's main structure including the classification system structure, link structure abstract, text link structure and page redirection disambiguation, this paper proposes a comprehensive method and related categories the link correlation calculation of correlation degree between words. In order to explore the deep semantic information, proposes the use of association rules to calculate the correlation of the words Calculation method. On this basis, this paper puts forward the calculation method of correlation degree between sentences, mainly from three aspects: the calculation of correlation between sentence structure, correlation calculation of the clustering and the use of theme words weighted clustering correlation calculation. Based on the sentence structure and consists of two aspects: the form and word order. In the calculation of correlation form, which reflected by calculating word co-occurrence frequency; word order in calculation, embodied by the reverse calculation of the number of the words of the correlation calculation. The main deep semantics of words in sentences based on the information, more consistent with human subjective understanding. Clustering is mainly semantic Related words or text together as a class or a cluster, this paper will use to calculate the correlation between the sentence, to improve the accuracy of calculating the correlation of the sentence. Based on theoretical methods of forming on the complete experimental design at first. Download Wikipedia, Chinese corpus; secondly to complete the calculation of correlation degree between words and sentences; the results were compared with manual annotation, we choose the Word Similarity-353 manual translation test set and the National University of Defense Technology statistics Words-240 as word correlation test set sentence correlation test set selection Chinese web database knowledge extraction contest provides short text semantic correlation match data sets, the correlation coefficient compared Spearman parameters and accuracy, calculating the relationship of words, Spearman parameter method in this paper is 2.8% higher than the traditional sentence correlation algorithm, the accuracy rate reached 73.3%, achieved good experimental results proved that this method. The rationality and practicability.

【学位授予单位】：太原理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】