当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于长尾查询需求理解的搜索引擎性能改进

发布时间:2018-11-09 08:27
【摘要】:搜索引擎是人们获取信息的重要工具。用户搜索引擎中查找需求目标时需要构建查询词,查询词的频度服从幂律分布,我们将处于分布尾端的查询词称为长尾查询。在真实搜索引擎数据上的分析发现,长尾查询约占独立查询总数的70%,并且几乎所有用户都有长尾查询的需求。然而,长尾查询的用户行为数据稀疏,很难直接使用现有的检索质量优化方法,成为搜索引擎的一个难点。通过对真实搜索引擎日志的抽样分析,我们发现长尾查询中有相当一部分是由于表述不恰当而导致不能有效检索到正确的结果而非缺少满足用户需求的网络资源。针对这一问题,我们尝试通过分析用户改写查询的行为理解用户查询需求,帮助用户找到合适的查询表述形式,并直接对查询结果进行优化。本文工作的主要贡献如下:1.对查询改写行为模式的分析与预测。结合前人研究工作,将查询改写行为模式划分为四种类型:New Topic(新主题),Generalization(一般化),Specification(特殊化)和Parallel(平行主题)。通过对真实搜索引擎日志抽样数据进行分析,提出查询改写行为模式的预测和分类方法,整体精度达到79.29%,为进一步理解用户需求做好基础。2.对长尾查询结果相关度进行自动评价。分析长尾查询结果文档的相关度与展现情况和点击情况的关系,提取了点击特征、标红特征和搜索引擎排序特征,训练基于集成学习方法的分类器,在预测结果相关度方面取得不错效果。3.提出多结果融合的长尾查询性能改进方法。通过挖掘长尾查询可能的改写词,寻找具有相似意图且表述更加恰当的查询词。进一步的,将这些查询改写词的结果与原查询的结果进行融合排序,直接在结果列表的层面对长尾查询进行改进。我们的方法引入了新的结果而不仅仅是重排序。在排序过程中,加入了体现原查询能否被改进的信息。真实搜索引擎数据上的实验显示,该方法在ERR@10评价指标上得到3.69%的显著提升。值得一提的是,我们的方法对于非长尾查询性能的改进同样有效。4.提出基于用户意图理解的长尾查询性能改进系统。将查询改写行为的预测与多结果融合的方法相结合,引入单个用户的个性化信息,有针对性的引入新的结果文档,性能提升效果有进一步提高。
[Abstract]:Search engine is an important tool for people to obtain information. The search engine needs to construct the query term when searching for the requirement target. The frequency of the query term is distributed according to the power law. We call the query word at the end of the distribution as long tail query. Based on the analysis of real search engine data, it is found that long-tailed queries account for about 70% of the total number of independent queries, and almost all users have the demand for long-tailed queries. However, the user behavior data of long tail query is sparse, it is difficult to directly use the existing search quality optimization method, which becomes a difficulty in search engine. Through the sampling analysis of real search engine logs, we find that some of the long tail queries are not able to retrieve the correct results effectively because of improper representation, rather than lack of network resources to meet the needs of users. In order to solve this problem, we try to understand the user's query requirements by analyzing the behavior of rewriting the query, help the user to find the appropriate query expression, and directly optimize the query results. The main contributions of this paper are as follows: 1. Analysis and prediction of query rewriting behavior pattern. Combined with previous research work, the query rewriting behavior pattern is divided into four types of: New Topic (new topic,), Generalization (generalization,), Specification (specialization) and Parallel (parallel topic). Based on the analysis of real search engine log sampling data, the prediction and classification methods of query rewriting behavior patterns are proposed. The overall accuracy reaches 79.29, which is the basis for further understanding the user needs. 2. The correlation of long tail query results is evaluated automatically. This paper analyzes the relationship between the correlation degree of long tail query result document and display and click, extracts click feature, red feature and search engine sorting feature, and trains a classifier based on integrated learning method. Good results have been achieved in the correlation of prediction results. 3. 3. A long tail query performance improvement method based on multi-result fusion is proposed. By mining the possible rewriting words of the long tail query, we can find the query words with similar intention and more appropriate expression. Furthermore, the results of these rewriting words are fused with the results of the original query, and the long-tailed query is improved directly at the level of the result list. Our approach introduces new results, not just reordering. In the process of sorting, information is added to reflect whether the original query can be improved. Experiments on real search engine data show that this method can significantly improve the ERR@10 evaluation index by 3.69%. It is worth mentioning that our method is also effective for improving the performance of non-long tail queries. 4. 4. A long tail query performance improvement system based on user intention understanding is proposed. The prediction of query rewriting behavior is combined with the method of multi-result fusion, and the individualized information of individual user is introduced, and the new result document is introduced pertinently, and the performance improvement effect is further improved.
【学位授予单位】:清华大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP391.3

【相似文献】

相关期刊论文 前10条

1 张志宽;罗晓沛;;基于Web Dynpro Java平台的查询技术应用分析[J];计算机工程与设计;2009年20期

2 敖鹏蛟;裴志伟;;集成电路生产线EAP监控系统大数据量查询性能优化方法[J];工业控制计算机;2013年11期

3 武德亮;如何提高INFORMIX-ONLINE数据库数据查询性能[J];中国金融电脑;2001年04期

4 ;开发人员升级至ASE 15.0的10大理由(十)[J];铁路计算机应用;2011年03期

5 薛颖;闵联营;邱桥春;;基于hibernate缓存机制的查询性能优化研究[J];电脑知识与技术(学术交流);2007年17期

6 钟玲;张丹;孙淑杰;贾军;;MapX4.0中存在的问题及查询性能研究[J];沈阳工业大学学报;2006年02期

7 徐怀平;;优化Oracle的查询性能[J];电脑编程技巧与维护;2012年23期

8 李锴;;基于查询性能预测的案例库维护策略[J];山西电子技术;2010年02期

9 张晓丽;;SQL查询性能的优化研究[J];西安航空技术高等专科学校学报;2009年01期

10 ;关于TPC-H测试[J];每周电脑报;2008年10期

相关会议论文 前1条

1 刘静;;浅析提高SQL查询性能的方法[A];'06MIS/S&A学术交流会论文集[C];2006年

相关重要报纸文章 前1条

1 ;富士通:搜索新纪元[N];计算机世界;2004年

相关硕士学位论文 前7条

1 霍帅;基于长尾查询需求理解的搜索引擎性能改进[D];清华大学;2015年

2 洪佳;OLAP系统的查询性能研究[D];天津工业大学;2007年

3 彭敦志;基于聚集系数的文本检索查询性能预测[D];中国科学技术大学;2009年

4 李桂花;基于DB2关系型数据库的查询性能调优[D];电子科技大学;2010年

5 王昆;Spring框架下Web查询性能优化研究[D];西南交通大学;2009年

6 武佳林;XML数据索引技术与优化[D];辽宁师范大学;2010年

7 邓克国;基于前缀编码的有序XML文档更新计算研究[D];电子科技大学;2011年



本文编号:2319884

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2319884.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户b76c7***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com