基于《知网》义原空间的文本相似度计算研究与实现

发布时间：2018-02-10 20:37

本文关键词： 文本相似度 VSM GVSM 语义相似度《知网》文本查重系统　出处：《重庆大学》2013年硕士论文　论文类型：学位论文

【摘要】：文本相似度计算是知识产权保护、文本分类、机器翻译、自然语言处理、复制检测、自动问答和信息检索等领域的核心技术。现有的文本相似度计算方法大致可以归纳为两类，第一类是基于文本特征统计的方法，第二类则是基于文本语义理解的方法。基于文本特征统计的方法在长文本等大粒度实体的相似度计算方面取得了较好的效果，其中最具代表性的就是向量空间模型(Vector Space Model,简称VSM)和广义向量空间模型（General Vector Space Model,简称GVSM）。GVSM在VSM的基础上利用文本特征项的共现信息，对VSM模型中特征项正交的假设进行了改进。基于语义理解的方法，通常以某种知识库作为依据实现词语之间或者句子之间相似度的计算。基于统计的方法简单高效，但是缺乏语义，无法处理自然语言中“一词多义”和“一义多词”的情况。而基于语义理解的方法往往计算比较复杂，不适合大规模的文本处理。本文借鉴了广义向量空间模型的思想，利用知识库《知网》中的义原，，提出了一种基于《知网》义原空间的文本语义相似度计算方法（Sememe Vector SpaceModel，简称SVSM）。 SVSM把基于统计和语义理解的方法相互结合，将文本表示为义原空间中的向量，并通过计算文本义原向量之间的夹角实现文本相似度的计算。为了验证提出方法的有效性，本文通过文本聚类实验对比了SVSM与经典的VSM和GVSM模型。实验结果表明本文提出的算法在语义相似度计算方面相比VSM与GVSM有所提高。在义原文本相似度计算方法的基础上，本文基于J2EE平台设计并实现了一个文本查重系统。在该系统中将《知网》的义原、概念、词语、义原之间的相似度和词语的义原向量表示设计为数据库中的关系表。这样在进行文本相似度计算时可以直接查表取得相关信息，避免了重复计算，提高了文本相似度计算的效率。利用开源的软件工具包Lucence、ICTCLAS、hibernate Search等实现文本义原向量的构建和相似度的计算。通过将文本中实现的查重系统应用于实际的工程应用中，取得了良好的使用效果。
[Abstract]:Text similarity calculation is the core technology in the fields of intellectual property protection, text classification, machine translation, natural language processing, copy detection, automatic question answering and information retrieval. The first method is based on text feature statistics and the second is based on text semantic understanding. The method based on text feature statistics has achieved good results in the similarity calculation of large grained entities such as long text. Among them, the vector space model (VSM) and the generalized vector space model (GVSM) are the vector space model (VSM) and the generalized vector space model (GVSM). The GVSM).GVSM uses the cooccurrence information of the text feature on the basis of VSM. The hypothesis of orthogonality of feature items in VSM model is improved. Based on semantic understanding, some knowledge base is usually used to calculate the similarity between words and sentences. The statistical method is simple and efficient. However, the lack of semantics makes it impossible to deal with the cases of "polysemy" and "one meaning multi-word" in natural language, and the method based on semantic understanding is often complicated and is not suitable for large-scale text processing. In this paper, the idea of generalized vector space model is used for reference, and the meaning of knowledge net in knowledge base is used. In this paper, a method for calculating semantic similarity of text based on semantic primitive space is proposed, which is called Sememe Vector Space Model. SVSM combines the methods based on statistics and semantic understanding to express the text as vectors in the sememe space. In order to verify the effectiveness of the proposed method, the text similarity is calculated by calculating the angle between the literal primitive vectors. This paper compares SVSM with classical VSM and GVSM models through text clustering experiments. The experimental results show that the proposed algorithm is better than VSM and GVSM in semantic similarity calculation. On the basis of the similarity calculation method of semantic text, this paper designs and implements a text checking and rechecking system based on J2EE platform. The similarity between semantic elements and the semantic vector representation of words is designed as a relational table in the database. In this way, we can directly look up the table to obtain relevant information when calculating the text similarity, thus avoiding double calculation. The efficiency of text similarity calculation is improved. By using open source software toolkit LucenceCass hibernate Search, the text semantic primitive vector is constructed and the similarity is calculated. Good results have been obtained.
【学位授予单位】：重庆大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】