分布式搜索引擎系统的分发调度与融合排序

发布时间：2018-12-31 20:50

【摘要】：随着互联网的发展，，网页信息已经呈爆炸式的增长。在资金和设备有限的情况下，很多检索系统只能独立完成某领域或某方面的资源索引与检索，很难将检索系统覆盖到全网。分布式信息检索则提供一种解决方法，它作为一种分布式的架构，能够有效利用各分布的闲散的资源来提供信息检索服务。分布式信息检索主要是指在分布式的环境中，利用分布式计算和移动代理等技术从大量的、异构的信息资源中检索出对用户有用的信息的过程。然而由于不同的信息资源具有不同的数据存储结构和检索策略，分布式搜索系统的关键技术问题包括：如何实现资源的内容描述，并根据描述和查询的比较选择资源结点，即查询的分发和结点调度问题；如何把来自不同资源节点的文档列表合并，即检索结果的融合和排序问题。本文阐述了“下一代互联网分布式搜索引擎系统”的设计思想和实现细节，在这个系统的基础上对上述分发调度和融合两个问题进行研究，给出在这个系统上的解决方案，并在系统中实现。对于分发调度策略，本文提出首先通过特征词和随机+高频词采样两种方式来获得资源描述信息，然后结合资源描述和历史检索信息对资源评分和选择；对于融合排序策略，本文结合应用需求提出了相似度原则和多元化原则，并综合这两个原则制定出与以往算法策略侧重点不同的融合排序策略。本文对提出的两个策略在系统上进行了实验评测，给出了系统在使用策略前后的实验对比数据和分析，结果表明本文所给出的分发调度和融合排序策略使得系统在检索结果的召回率和查准率方面都得到了提高，并保证了检索结果的多样性。
[Abstract]:With the development of the Internet, web information has been explosive growth. Under the condition of limited funds and equipments, many retrieval systems can only complete the index and retrieval of resources in one field or another independently, and it is very difficult to cover the whole network with the retrieval system. Distributed information retrieval provides a solution. As a distributed architecture, it can effectively utilize the idle resources of each distribution to provide information retrieval services. Distributed information retrieval mainly refers to the process of retrieving useful information for users from a large number of heterogeneous information resources using distributed computing and mobile agent technologies in a distributed environment. However, because different information resources have different data storage structures and retrieval strategies, the key technical problems of distributed search system include: how to realize the content description of resources, and how to select resource nodes according to the comparison between description and query. That is, query distribution and node scheduling; How to merge the list of documents from different resource nodes, that is, the fusion and sorting of retrieval results. This paper describes the design idea and implementation details of the "next Generation Internet distributed search engine system". On the basis of this system, the above two issues of distribution scheduling and fusion are studied, and the solution on this system is given. And realized in the system. For the distribution scheduling strategy, this paper proposes two ways to obtain resource description information: feature words and random high-frequency word sampling, and then score and select resources by combining resource description and historical retrieval information. For the fusion ranking strategy, this paper proposes the similarity principle and the diversity principle according to the application requirements, and combines these two principles to work out the fusion ranking strategy which is different from the previous algorithm strategy. In this paper, the experimental evaluation of the two strategies is carried out on the system, and the comparative data and analysis of the system before and after the use of the strategy are given. The results show that the proposed distribution scheduling and fusion scheduling strategies can improve the recall and precision of retrieval results and ensure the diversity of retrieval results.
【学位授予单位】：华南理工大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】