信息检索中支持隐式时间查询的文档排名方法

发布时间：2018-04-11 00:09

本文选题：时态信息检索 + 查询时间意图　；参考：《江苏大学》2017年硕士论文

【摘要】：互联网的普及带来了信息资源的爆炸式增长,为用户提供更多选择机会的同时也增加了寻找有效信息的难度,于是如何利用搜索引擎从海量的信息中筛选出满足用户需求的文档成为了一个重要的挑战。近年来,互联网中包含时间信息的网页与查询数目不断增多,时态信息检索(Temporal Information retrieval,TIR)成为研究人员关注的热点。它主要研究如何使用有效的技术提取网页中的时态信息,分析查询的时间意图以及建立与时间有关的检索排名模型等以改善搜索引擎的检索质量。信息检索中具有时间意图的查询分为两种,一种查询中包含时间表达式,明确指定时间约束,称为显式时间查询;而另一种查询中没有提供明确的时间标准,但查询的时间意图在某个特定的时间区间,称为隐式时间查询。据统计,互联网中超过7%的查询包含隐式时间意图,大约1.5%的查询包含明确的时间约束,可见隐式时间查询在互联网查询中占据的比例更大,有更多的研究工作有待开展。本论文研究如何分析隐式时间查询的时间意图与优化检索性能,主要的工作内容归纳如下:(1)对于隐式时间查询,提出了一种结合语义网DBpedia和排名前k个文档分析查询时间意图的方法。如果用户查询的内容是关于著名人物或者历史上某个重大事件,则查询DBpedia(基于维基百科的语义网)得到的具体的时间区间作为查询的时间意图;其他类型的查询使用排名前k个文档内容中出现频率较高的时间表达式分析查询的时间意图。(2)在语言模型的基础上提出一种支持隐式时间查询的文档排名模型,考虑时间不确定性因素计算各个文档产生查询的概率作为文档时间相关性得分,最后线性结合时间相关性得分和内容相关性得分对文档重新排序。(3)使用NTCIR-11会议Temporal Information Access(Temporalia)任务中的文档集作为实验数据,评价本文提出的分析隐式时间查询意图方法和文档排名模型的性能。首先与已提出的几种分析查询时间意图的方法比较,实验结果表明在计算文档相关性得分前分析查询的时间意图具有一定的意义,本文提出的结合DBpedia和排名前k个文档方法能够较好地分析查询时间意图。在得到查询时间意图的基础上,比较本文提出的方法与目前已存在的考虑时间因素排名方法的性能,结果显示考虑时间因素的排名模型中大多数的指标值都高于仅考虑内容相关性的初始排名,说明在检索模型中考虑时间相关性有利于改善检索质量。与其他的排名方法相比,本文提出的基于语言模型的排名方法性能较好。
[Abstract]:The popularity of the Internet has brought explosive growth of information resources, providing users with more choice opportunities and increasing the difficulty of finding effective information.Therefore, how to use search engines to select documents from massive information to meet the needs of users has become an important challenge.In recent years, the number of web pages and queries containing time information in the Internet has been increasing. Temporal Information retrieval (TIR) has become a hot topic for researchers.It mainly studies how to use effective techniques to extract temporal information from web pages, analyze the temporal intention of queries and establish time-related search ranking models to improve the search quality of search engines.There are two kinds of queries with time intention in information retrieval. One kind of query contains a time expression, which explicitly specifies time constraints, which is called explicit time query, and the other kind of query does not provide a clear time standard.But the time intention of the query is in a specific time interval, which is called implicit time query.According to statistics, more than 7% of the queries in the Internet contain implicit time intention, and about 1.5% of the queries contain explicit time constraints. It can be seen that implicit time queries occupy a larger proportion in Internet queries, and more research work needs to be carried out.In this paper, we study how to analyze the time intention of implicit time query and optimize its retrieval performance. The main work is summarized as follows: 1) for implicit time query,This paper presents a method of analyzing query time intention by combining semantic web DBpedia with top k documents.If the content of a user query is about a famous person or a major event in history, the specific time interval obtained by the query DBpedia (Wikipedia based semantic Web) is taken as the time intention of the query.Other types of queries analyze the time intention of the query using the high frequency time expression in the top k document contents.) based on the language model, a document ranking model supporting implicit time query is proposed.Considering the time uncertainty factor, the probability of each document producing query is calculated as the document time correlation score.Finally, a linear combination of time correlation score and content correlation score is used to resort the document using the document set in the NTCIR-11 meeting Temporal Information access temporary Task as experimental data.The performance of the implicit time query intention method and the document ranking model proposed in this paper is evaluated.The experimental results show that it is significant to analyze the time intention of the query before calculating the correlation score of the document.The proposed method combined DBpedia with the top k documents can well analyze the query time intention.On the basis of obtaining the time intention of the query, this paper compares the performance of the proposed method with the existing ranking method considering time factors.The results show that most of the index values in the ranking model taking into account time factors are higher than the initial ranking which only considers the content correlation, which indicates that considering time correlation in the retrieval model is beneficial to improve the retrieval quality.Compared with other ranking methods, the proposed ranking method based on language model has better performance.
【学位授予单位】：江苏大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】