基于查询日志的查询扩展研究
发布时间:2018-06-17 22:56
本文选题:查询扩展查 + 查询日志分析 ; 参考:《北京邮电大学》2013年硕士论文
【摘要】:如今互联网已经步入人们生活的每一个角落。互联网上的信息量越来越大,增长速度也越来越快。在互联网的海量信息面前,如何从中获取人们所需要的信息已经成为信息检索领域的热点。目前,主流的搜索引擎的查询方式仍然是基于关键字匹配。面对海量信息,仅仅基于关键字匹配的查询方法很难给出用户满意的查询结果,因此查询扩展技术应运而生。目前,查询扩展已经有了一定的发展。本文在分析以往算法不足的基础上,将众包思想与用户查询日志相结合,提出了基于众包思想的查询扩展算法。实验表明,新算法对查询效果有明显的改善。论文的主要工作如下: 首先,本文介绍了查询扩展的研究背景、查询扩展的发展概况并简单描述了本文的研究和工作内容。其次,本文介绍了信息检索与查询扩展相关理论,并且详细研究了目前主流的查询扩展算法并分析其优缺点。再次,本文还简要介绍了众包思想及其实现算法——“最大期望算法(Expectation Maximization,EM)"的原理,并对其进行改进,为众包思想与用户查询日志的结合提供准备。 本文对用户查询日志进行了详尽的统计分析,主要包括用户查询词特征分析、查询过程中的会话特征分析和用户点击分析。这些分析既是查询扩展的原因,也是查询扩展的基础。 本文利用搜狗公司提供的数据集,对其进行了一些预处理后利用Indri搜索引擎建立起了一个与用户查询日志相匹配的简易搜索引擎平台,用于进行实验。 本文提出了基于众包的查询扩展算法。将众包思想引入查询扩展,根据用户查询日志的特点,将用户的查询过程转化为一个众包过程。随后,本文利用改进的EM算法对相关文档进行重排序,并在重排序后的文档中筛选扩展词。本文在自建的搜索平台中进行了实验,实验结果表明,本文提出的算法与一些传统的查询扩展算法相比在P@20的评价标准上对查询效果有明显的改善。
[Abstract]:The Internet has entered every corner of people ' s life . The amount of information on the Internet is getting more and more rapid . How to get the information needed by people has become a hot spot in the field of information retrieval . At present , the query method of the mainstream search engine is still based on the key word matching . At present , the query extension technology is developed . At present , the query extension technology based on the keyword matching is very difficult to give the query result of the user ' s satisfaction . The experiment shows that the new algorithm has obvious improvement on the query effect . The main work of this paper is as follows :
Firstly , this paper introduces the research background of query extension , the development of query extension and the brief description of the research and work contents of this paper . Secondly , this paper introduces the theory of information retrieval and query extension , and studies the current mainstream query expansion algorithm and analyzes its advantages and disadvantages . Thirdly , this paper also briefly introduces the principle of the idea of crowdsourcing and its realization algorithm _ " Maximum expectation algorithm ( EM ) " , and provides the preparation for the combination of crowdsourcing ideas and user query logs . This paper makes a detailed statistical analysis of the user ' s query log , including the characteristic analysis of the user ' s query words , the conversation feature analysis in the query process and the user ' s click analysis . These analyses are both the cause of query expansion and the foundation of query extension . In this paper , a simple search engine platform matched with the user ' s query log is established by using the data set provided by the search dog company , and the experiment is carried out by using the Indri search engine . In this paper , a query extension algorithm based on crowdsourcing is proposed . The idea of crowdsourcing is introduced into query extension . The query process of users is transformed into a crowdsourcing process based on the characteristics of user query logs . The paper makes use of the improved EM algorithm to reorder the relevant documents , and filters the expanded words in the documents after reordering . The results show that the proposed algorithm has obvious improvement on the query results compared with some traditional query expansion algorithms .
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前3条
1 熊忠阳;向海燕;张玉芳;;结合用户日志的局部上下文分析方法[J];计算机工程与应用;2012年12期
2 黄名选;严小卫;张师超;;查询扩展技术进展与展望[J];计算机应用与软件;2007年11期
3 余慧佳;刘奕群;张敏;茹立云;马少平;;基于大规模日志分析的搜索引擎用户行为分析[J];中文信息学报;2007年01期
,本文编号:2032807
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2032807.html