基于图的文档检索技术研究

发布时间：2018-08-23 14:29

【摘要】：随着计算机技术和互联网的发展,信息检索已经成为日常生产生活中不可缺少的一部分,更受到学术界的高度关注。近年来,图数据的使用方兴未艾,互联网的发展伴随着大数据的增长,使得越来越多的应用产生图数据。图数据的研究近年来也炙手可热。文档检索的主要任务是计算用户输入的查询词和文档的相似度,并将文档依照相似度排序返回给用户。向量空间模型是信息检索领域中的基本模型,也是文档检索领域中最常用的模型。当今很多广受欢迎的文档检索系统依旧以向量空间模型为核心。由于向量空间模型在检索中将词项视作独立无关的,割裂了词项之间的关系。而实际的文本中,词项与词项之间通常都有相关性。这就导致了以向量空间模型为核心的文档检索系统会存在如下的情况:计算出与查询词相似度很高的文档,其内容的意思与查询词关联性不够高,甚至意思完全相反。而近年来图数据得到广泛应用,很重要的原因就是图能直观地表示节点与边之间的关系。基于以上问题,本文提出了基于图的文档检索方法。将查询词和文档用图进行表示。通过计算查询图和文档图之间的相似度来得到查询词和文档之间相似度的方法,对查询词和文档的相似程度进行定量化计算。首先,本文利用自然语言处理中的依存分析和词性标注的研究成果,提出基于依存分析的文本表示图模型,将查询词和文档文本表示成图。考虑到图计算的开销问题,本文提出文档语义单元的概念,并以文档语义单元为粒度构建图,这样不同于以往信息检索中将查询与文档视为对等的实体,本文提出的方法是将查询词和文档放在不对等的层面上;其次,本文基于图论的相关知识,提出基于广义最大公共子图的图相似度计算算法,由此可得到查询图模型和文本图模型的相似度;再次,使用上一步得到查询和文档各个语义单元的相似度数据,考虑到文档中不同位置的语义单元的重要程度可能不同,本文提出文档评分方法,计算查询和文档之间的相似度并以此作为排序和返回结果的依据。最后,分别利用中文和英文两个文档集,通过分析不同文档评分方法下算法的结果质量的表现以及和现有的方法与技术的结果进行对比,实验表明,本文提出的方法能得到质量更高的文档检索结果。
[Abstract]:With the development of computer technology and Internet, information retrieval has become an indispensable part of daily production and life. In recent years, the use of graph data is in the ascendant. With the development of big data, more and more applications produce graph data. The study of graph data is also hot in recent years. The main task of document retrieval is to calculate the similarity between the query words entered by the user and the document, and return the documents to the user according to the similarity. Vector space model is the basic model in the field of information retrieval, and it is also the most commonly used model in the field of document retrieval. Nowadays, many popular document retrieval systems still take vector space model as the core. Because the vector space model regards the word item as independent in the retrieval, it separates the relation between the word items. In the actual text, there is usually a correlation between the word item and the word item. This leads to the following situations in the document retrieval system with vector space model as the core: the document with high similarity to the query words is calculated, and the meaning of the document is not high enough to be related to the query words, even the meaning is completely opposite. In recent years, graph data have been widely used, the important reason is that graph can represent the relationship between nodes and edges intuitively. Based on the above problems, this paper proposes a graph-based document retrieval method. The query words and documents are graphically represented. The similarity between query words and documents is calculated by calculating the similarity between query graph and document graph, and the similarity degree between query word and document is calculated quantitatively. Firstly, based on the research results of dependency analysis and part of speech tagging in natural language processing, a text representation graph model based on dependency analysis is proposed, in which query words and document texts are represented as graphs. Considering the overhead of graph computation, this paper proposes the concept of document semantic unit, and takes document semantic unit as granularity to construct graph, which is different from the fact that query and document are regarded as equivalent entities in information retrieval. The method proposed in this paper is to put the query words and documents on the unequal level. Secondly, based on the related knowledge of graph theory, this paper proposes a graph similarity calculation algorithm based on the generalized maximum common subgraph. The similarity between the query graph model and the text graph model can be obtained. Thirdly, the similarity data between the query and each semantic unit of the document can be obtained by using the previous step, considering that the importance of the semantic unit at different locations in the document may be different. In this paper, a document scoring method is proposed to calculate the similarity between the query and the document and to use it as the basis for sorting and returning the results. Finally, by using the Chinese and English document sets, the performance of the algorithm under different document scoring methods is analyzed and compared with the results of the existing methods and techniques. The experimental results show that, The method proposed in this paper can obtain higher quality document retrieval results.
【学位授予单位】：哈尔滨工程大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.3

【参考文献】