分布式搜索的结果融合方法研究与实现
发布时间:2018-06-28 03:47
本文选题:分布式搜索引擎 + 联合检索 ; 参考:《华南理工大学》2013年硕士论文
【摘要】:随着互联网的高速发展,网页数量和信息丰富性增长速度迅猛,而且信息资源的分布和呈现越来越分布化,这就给传统的集中式搜索引擎带来了很多挑战,尤其在系统的可拓展性、以及如何检索“深层网络”并实现搜索结果的多样化等关键问题上。因此为了适应新一代网络信息分布的构造特点和潜在的发展趋势,分布式搜索引擎系统将是一种比较合适的解决方案。基于可扩展的分布式架构,,分布式搜索引擎能够有效利用分布的资源,综合信息资源的多样化,并提供给用户更为全面准确的信息检索服务。 本文工作来源于国家下一代互联网CNGI项目“下一代互联网分布式搜索引擎”。本文主要研究分布式搜索引擎平台的联合检索系统,该检索系统自动将查询分发给各独立的搜索引擎(单元搜索引擎),并对各单元搜索引擎的返回结果进行结果融合,以提供给用户综合的优化排序结果。联合检索系统的核心技术是查询分发和结果融合,选择合适的查询分发策略,利用查询分发的选择来对检索结果进行综合优化的融合排序是本文的主要研究内容。 本文基于来自于校园网的实际数据集特性,通过挖掘单元搜索引擎的静态和动态资源特征,采用资源评分衡量单元搜索引擎和查询词的相关程度,提出了基于资源评分的查询分发策略,该策略能够选择与查询词相关度高的单元搜索引擎进行查询分发,保证返回结果的质量。在完成查询分发策略的基础上,提出本文的综合优化的结果融合排序算法,包括了采用文档分数归一化的方式规范化结果文档评分、基于查询分发的资源评分设计合理的融合算法和强化多样化结果的融合机制,最后通过实验验证本文提出的查询分发策略和结果融合算法能够提高系统的查准率,并保证多样化的展示效果,从而满足用户多角度查询的需求。
[Abstract]:With the rapid development of the Internet, the number of web pages and the richness of information are growing rapidly, and the distribution and presentation of information resources are becoming more and more distributed, which brings a lot of challenges to the traditional centralized search engine. Especially in the system scalability, and how to retrieve the "deep network" and achieve the diversification of search results and other key issues. Therefore, in order to adapt to the new generation of network information distribution characteristics and potential development trend, distributed search engine system will be a more suitable solution. Based on the extensible distributed architecture, the distributed search engine can effectively utilize the distributed resources, synthesize the diversification of the information resources, and provide users with more comprehensive and accurate information retrieval services. The work of this paper comes from the National next Generation Internet (CNGI) project, the next Generation Internet distributed search engine. This paper mainly studies the joint search system of distributed search engine platform, which automatically distributes the query to each independent search engine (unit search engine), and fuses the results of each unit search engine. In order to provide users with a comprehensive optimization of the sorting results. The core technology of the joint retrieval system is query distribution and result fusion. The main research content of this paper is to select the appropriate query distribution strategy and to optimize the retrieval results synthetically by the selection of query distribution. In this paper, based on the characteristics of the actual data set from the campus network, the static and dynamic resource features of the unit search engine are mined, and the correlation degree between the unit search engine and the query word is measured by using the resource score. A query distribution strategy based on resource scoring is proposed. This strategy can select unit search engines with high correlation with query words to distribute queries and ensure the quality of the returned results. Based on the completion of the query distribution strategy, this paper proposes a comprehensive optimization of the results fusion sorting algorithm, including the normalization of the result document score by using the method of document score normalization. Resource scoring based on query distribution designed a reasonable fusion algorithm and enhanced the fusion mechanism of diversified results. Finally, the experiment proved that the query distribution strategy and the result fusion algorithm proposed in this paper can improve the precision of the system. And to ensure a variety of display effects, so as to meet the needs of users from multiple angles of inquiry.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前1条
1 张强弓,喻国宝,廖湖声,隋树林;一种元搜索引擎的查询结果处理模型[J];华南理工大学学报(自然科学版);2004年S1期
本文编号:2076604
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2076604.html