基于长尾查询需求理解的搜索引擎性能改进
[Abstract]:Search engine is an important tool for people to obtain information. The search engine needs to construct the query term when searching for the requirement target. The frequency of the query term is distributed according to the power law. We call the query word at the end of the distribution as long tail query. Based on the analysis of real search engine data, it is found that long-tailed queries account for about 70% of the total number of independent queries, and almost all users have the demand for long-tailed queries. However, the user behavior data of long tail query is sparse, it is difficult to directly use the existing search quality optimization method, which becomes a difficulty in search engine. Through the sampling analysis of real search engine logs, we find that some of the long tail queries are not able to retrieve the correct results effectively because of improper representation, rather than lack of network resources to meet the needs of users. In order to solve this problem, we try to understand the user's query requirements by analyzing the behavior of rewriting the query, help the user to find the appropriate query expression, and directly optimize the query results. The main contributions of this paper are as follows: 1. Analysis and prediction of query rewriting behavior pattern. Combined with previous research work, the query rewriting behavior pattern is divided into four types of: New Topic (new topic,), Generalization (generalization,), Specification (specialization) and Parallel (parallel topic). Based on the analysis of real search engine log sampling data, the prediction and classification methods of query rewriting behavior patterns are proposed. The overall accuracy reaches 79.29, which is the basis for further understanding the user needs. 2. The correlation of long tail query results is evaluated automatically. This paper analyzes the relationship between the correlation degree of long tail query result document and display and click, extracts click feature, red feature and search engine sorting feature, and trains a classifier based on integrated learning method. Good results have been achieved in the correlation of prediction results. 3. 3. A long tail query performance improvement method based on multi-result fusion is proposed. By mining the possible rewriting words of the long tail query, we can find the query words with similar intention and more appropriate expression. Furthermore, the results of these rewriting words are fused with the results of the original query, and the long-tailed query is improved directly at the level of the result list. Our approach introduces new results, not just reordering. In the process of sorting, information is added to reflect whether the original query can be improved. Experiments on real search engine data show that this method can significantly improve the ERR@10 evaluation index by 3.69%. It is worth mentioning that our method is also effective for improving the performance of non-long tail queries. 4. 4. A long tail query performance improvement system based on user intention understanding is proposed. The prediction of query rewriting behavior is combined with the method of multi-result fusion, and the individualized information of individual user is introduced, and the new result document is introduced pertinently, and the performance improvement effect is further improved.
【学位授予单位】:清华大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP391.3
【相似文献】
相关期刊论文 前10条
1 张志宽;罗晓沛;;基于Web Dynpro Java平台的查询技术应用分析[J];计算机工程与设计;2009年20期
2 敖鹏蛟;裴志伟;;集成电路生产线EAP监控系统大数据量查询性能优化方法[J];工业控制计算机;2013年11期
3 武德亮;如何提高INFORMIX-ONLINE数据库数据查询性能[J];中国金融电脑;2001年04期
4 ;开发人员升级至ASE 15.0的10大理由(十)[J];铁路计算机应用;2011年03期
5 薛颖;闵联营;邱桥春;;基于hibernate缓存机制的查询性能优化研究[J];电脑知识与技术(学术交流);2007年17期
6 钟玲;张丹;孙淑杰;贾军;;MapX4.0中存在的问题及查询性能研究[J];沈阳工业大学学报;2006年02期
7 徐怀平;;优化Oracle的查询性能[J];电脑编程技巧与维护;2012年23期
8 李锴;;基于查询性能预测的案例库维护策略[J];山西电子技术;2010年02期
9 张晓丽;;SQL查询性能的优化研究[J];西安航空技术高等专科学校学报;2009年01期
10 ;关于TPC-H测试[J];每周电脑报;2008年10期
相关会议论文 前1条
1 刘静;;浅析提高SQL查询性能的方法[A];'06MIS/S&A学术交流会论文集[C];2006年
相关重要报纸文章 前1条
1 ;富士通:搜索新纪元[N];计算机世界;2004年
相关硕士学位论文 前7条
1 霍帅;基于长尾查询需求理解的搜索引擎性能改进[D];清华大学;2015年
2 洪佳;OLAP系统的查询性能研究[D];天津工业大学;2007年
3 彭敦志;基于聚集系数的文本检索查询性能预测[D];中国科学技术大学;2009年
4 李桂花;基于DB2关系型数据库的查询性能调优[D];电子科技大学;2010年
5 王昆;Spring框架下Web查询性能优化研究[D];西南交通大学;2009年
6 武佳林;XML数据索引技术与优化[D];辽宁师范大学;2010年
7 邓克国;基于前缀编码的有序XML文档更新计算研究[D];电子科技大学;2011年
,本文编号:2319884
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2319884.html