分布式信息检索中的若干重要问题研究

发布时间：2018-05-27 23:05

本文选题：分布式信息检索 + 信息检索　；参考：《北京邮电大学》2012年博士论文

【摘要】：分布式信息检索是信息检索中的重要研究领域之一。越来越多的检索系统都利用到了分布式检索理论和技术。例如,互联网的信息需求之一就是如何整合来自于各个垂直搜索引擎返回的结果,跨语言检索也无法避免的要处理不同语种下文档相关性排序的问题,专业的专利检索可能需要同时查询多个专利库等等。同时,研究上也论证过在一定的条件下分布式检索的效果优于传统检索。分布式信息检索是同时查询多个文档数据库的技术和方法。具体来说,检索系统在收到用户的查询时,首先会按照相关性对文档数据库进行选择,把查询送往选出的文档数据库,并从中得到返回的的检索结果,最后进行合并统一返回给用户。分布式信息检索主要有三个重要的问题：如何来描述文档数据库(文档数据库的描述),针对给定的查询如何选择合适的文档数据库(文档数据库的选择),如何对返回的结果进行合并(查询结果的合并)。经过详尽的调研,本文详细的研究了分布式信息检索的若干重要问题,取得了一定的创新性成果,主要工作成果如下： 1.对于文档数据库的描述问题,本文验证了基于查询的抽样算法在中文环境下的可靠性、稳定性和必要性。非协同环境下的基于查询的抽样算法是研究的重点和热点,之前的研究工作都是针对英文的标准数据集进行的,但是并没有专门研究证实其在中文环境的可靠和有效。本文在研究了基于查询的抽样算法的前提假设和基本理论之后,从实践的角度考虑,通过结构完整的逻辑清晰的实验验证其在中文环境下的可靠性和有效性,从检索流程上来说包括数据库描述层面的、数据库选择层面的、检索层面的测试和检验。一系列广泛的实验都证明了中文环境下的查询抽样技术的可行和高效,尤其是数据库描述层面的实验结果更是论证了抽样技术的可靠性、稳定性、必要性。 2.对于文档数据库的选择问题,本文提出了基于判别模型的选择算法和基于主题聚类的选择算法,并验证了其有效性。该领域已经出现过很多的研究工作。大致可分为基于词频的、基于文档的、基于分类／聚类的选择方法。从判别模型和生成模型的区别来看,本文的工作包括两点：第一,考虑不同数据库之间的信息,我们提出了一种基于判别模型的选择算法。第二,考虑到数据库的语义问题,我们从理论上提出了基于主题聚类的选择算法。对于前者,我们进行理论上的探讨。而后者是我们工作的重点,因为主题聚类算法不但考虑了文档因素的影响,而且引入了数据库的语义因素,这在建模上具有明显的可解释性。同时,我们也从概率图的角度对该类模型进行了统一的分析和解释。实验证实,基于主题聚类的选择算法在已有数据集上的表现是非常有竞争力的。 3.对于结果合并的问题,本文建模了加权曲线拟合算法,并证实对已有算法有明显的稳定的改善。结果合并领域的经典算法分别是CORI合并算法(CORI Merging)、SSL算法(Semi-Supervised Learning)、SAFE算法(Sample-Agglomerate Fitting Estimate)。SSL算法解决了CORI合并算法在非协同环境下的不稳定性问题；SAFE算法解决了SSL样本数量不足的问题。而SAFE算法在使用文档上也有其不足,主要有两点,其一是没有考虑文档排名不同而产生不同的重要性,其二是没有考虑文档的排名的估计偏差。针对这两点,在SAFE算法基础上,本文提出了加权曲线拟合算法(Weighted Curve Fitting,即WCF算法)。通过丰富的实验证明,与SAFE算法相比,WCF算法的优越性是一致的稳定的。在一定的环境下,我们给出了WCF算法达到最优的可能参数组合。
[Abstract]:Distributed information retrieval is one of the most important research fields in information retrieval. More and more retrieval systems have been used in distributed retrieval theory and technology. For example, one of the information requirements of the Internet is how to integrate the results returned from the vertical search engines, and the different languages can not be avoided to deal with different languages. Under the problem of document correlation sorting, professional patent retrieval may need to query multiple patent libraries at the same time. At the same time, research has demonstrated that the effect of distributed retrieval is better than traditional retrieval under certain conditions. Distributed information retrieval is a technique and method to query multiple document databases at the same time. When the user's query is received, it will first select the document database according to the relevance, send the query to the selected document database, and get the retrieved results from it, and then merge and return to the user. There are three important problems in the distributed information retrieval: how to describe the document database (document data) The description of the Library) how to select the appropriate document database (the selection of the document database) for a given query, and how to merge the returned results (the merge of the query results).
After detailed investigation, this paper has studied some important issues of distributed information retrieval in detail, and achieved some innovative results. The main results are as follows:
1. for the description of document database, this paper verifies the reliability, stability and necessity of query based sampling algorithm in Chinese environment.
The query based sampling algorithm in non cooperative environment is the focus and hot spot. The previous research work is based on the standard data set in English, but there is no special research to prove its reliability and effectiveness in the Chinese environment. In the perspective of practice, the reliability and effectiveness of the Chinese environment are verified through a complete and clear logical experiment. The retrieval process includes the database description level, the database selection level, the retrieval level test and the test. A series of extensive experiments have proved the query sampling technique in the Chinese environment. The feasibility and efficiency of the method, especially the experimental results at the database description level, demonstrates the reliability, stability and necessity of the sampling technology.
2. for document database selection problem, this paper proposes a selection algorithm based on discriminant model and a topic clustering based selection algorithm, and verifies its effectiveness.
There have been a lot of research work in this field. It can be roughly divided into word frequency based, document based, and clustering based selection methods. From the distinction between discriminant model and generation model, the work of this paper includes two points: first, considering the information between different databases, we propose a choice based on discriminant model. Second, considering the semantic problem of the database, we put forward a selection algorithm based on topic clustering in theory. For the former, we have a theoretical discussion. The latter is the focus of our work, because the theme clustering algorithm not only takes into account the influence of the document factors, but also introduces the semantic factors of the database, which is built. At the same time, we also analyze and explain the model from the point of view of probability graph. The experiment proves that the selection algorithm based on the topic clustering is very competitive on the existing data set.
3. for the result merging problem, this paper builds a weighted curve fitting algorithm, and proves that the algorithm has obvious stable improvement.
The classical algorithms in the merging area are CORI merging algorithm (CORI Merging), SSL algorithm (Semi-Supervised Learning) and SAFE algorithm (Sample-Agglomerate Fitting Estimate).SSL algorithm to solve the instability problem of the CORI merging algorithm in the non cooperative environment. There are two main points in the use of documents, one is that one is not considering the different importance of the document ranking, and the other is not considering the estimation deviation of the ranking of the document. On the basis of these two points, the weighted curve fitting method (Weighted Curve Fitting, WCF algorithm) is put forward on the basis of the SAFE algorithm. The experimental results show that the superiority of the WCF algorithm is consistent and stable compared with the SAFE algorithm. In a certain environment, we give the optimal possible parameter combination of the WCF algorithm.
【学位授予单位】：北京邮电大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP391.3;TP311.13

【相似文献】