基于图知识库的分布式信息检索集合选择方法

发布时间：2017-12-28 22:07

本文关键词：基于图知识库的分布式信息检索集合选择方法　出处：《浙江大学》2017年硕士论文　论文类型：学位论文

【摘要】：集合选择旨在选择少数的信息集合,其对于提高信息检索引擎的效率至关重要。目前,大多数集合选择方法使用中心样本集作为集合的描述信息。然而,这些方法仅使用样本文档的"形态句法"信息对集合进行建模,无法准确表示集合语义信息。因此,本文提出一种基于图知识库的集合选择方法(KBCS),使用加权的实体词表示集合的语义信息。首先,基于DBpedia图知识库,采用上下文相关度和结构相关度计算集合样本文档中任意一对实体词之间的语义距离,再度量实体词在集合中的权重。然后,综合考虑集合大小、集合实体词权重、查询实体词权重和实体词词频等因素,计算查询与集合的相关度。最后,基于相关度评分对集合进行排序,选择排名靠前的若干集合。此外,针对原始查询中实体词较少的问题,集成一种基于DBpedia的查询扩展方法。针对传统查询与集合相关度度量方法的局限性,采用LambdaMART排序学习算法,综合考虑多种相关度度量方法的度量结果,通过学习建立合理的集合排序模型。为了评估KBCS的有效性,本文选择ReDDE、CRCS和DLCS方法作为基准方法,并在海量网页数据集上进行了实验对比。实验结果表明本文提出的方法具有显著的性能优势。
[Abstract]:Set selection is designed to select a small number of information sets, which is essential to improve the efficiency of the information retrieval engine. At present, most set selection methods use the center sample set as the description information of the set. However, these methods only use the "morphological syntactic" information of the sample document to model the set, and can not accurately represent the semantic information of the collection. Therefore, this paper proposes a set selection method based on graph knowledge base (KBCS), which uses weighted entity words to represent the semantic information of the set. First, based on the DBpedia graph knowledge base, we use context correlation and structural correlation to calculate the semantic distance between any pair of entity words in a set of sample documents, and re measure the weight of entity words in the set. Then, the correlation between the query and the set is calculated by considering the set size, the weight of the aggregate entity word, the weight of the query entity word and the word frequency of the entity word. Finally, the set is sorted based on the correlation score, and a number of sets are selected before the ranking. In addition, a query extension method based on DBpedia is integrated to solve the problem of less entity words in the original query. Aiming at the limitation of traditional query and set correlation metric, we use LambdaMART ranking learning algorithm to consider the measurement results of various correlation measures, and establish a reasonable set sort model through learning. In order to evaluate the effectiveness of KBCS, this paper selects the ReDDE, CRCS and DLCS methods as the benchmark method, and compares the experiment with the massive web data set. The experimental results show that the proposed method has significant performance advantages.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【相似文献】