信息检索中相关反馈算法的研究

发布时间：2018-10-14 12:25

【摘要】：信息检索是关于信息的结构、分析、组织、存储、搜索和检索的领域。概括的说，信息检索就是从非结构化的信息集合中找出与用户需求相关的信息。信息检索的一个核心问题是注重用户和他们的信息需求，因为对搜索的评价是以用户为中心的。这种理念引发了大量关于人们怎样与搜索引擎进行交互的研究，特别是开发帮助用户表达他们的信息需求的技术。在用户参与的检索过程中，用户提交一个简短的查询，系统返回初次查询结果，，用户对部分结果进行标注，标注为相关或不相关，系统基于用户的反馈计算出一个更好的查询来表示信息需求，并重新返回一批更有可能满足用户需求的新的检索结果，这个过程叫做相关反馈。在信息检索过程中使用相关反馈技术能够优化查询结果，提高查询效率。本文从介绍相关反馈技术的现状出发，给出了相关反馈技术的有关算法，包括向量空间模型，概率模型和布尔模型中的相关反馈技术。其中，以基于向量空间模型的Rocchio相关反馈算法为主，详细介绍了该算法的思想和执行过程及其在某些特定情况下查询效果不好的现象，如某个查询的答案集合本身就需要不同类的文档来组成和通常以多个具体概念的或关系来出现的词这两个方面，对Rocchio相关反馈算法进行改进，使该算法在这两种特殊情况下也能得到好的返回结果。本文就此做了以下贡献：（1）在查询语句包含多条件内容时，根据Rocchio相关反馈算法的思想，提出了将包含有两个条件信息的文档集看成新的交叉类，在交叉类范围内，从离初始查询最近的质心开始，向着另一个质心不断移动，在此过程中找到理想结果。改进后的Rocchio相关反馈算法能够有效解决多条件查询时返回结果不理想的状况。（2）在多义词查询时，系统返回的结果往往混乱无序，本文设计了一种对结果属性进行聚类的算法：层次收缩算法。该算法首先获取系统返回结果的关键词，用布尔矩阵表达，然后以文档间关键词个数作为度量方式，计算文档间相似度，按照文档间相似度，以合取方式将文档层次合并，聚类结束后提取返回的标签。在不考虑召回率的情况下，该算法的最终结果收敛于对簇中文档具有高度表达性的关键词，具有较高的正确率。
[Abstract]:Information retrieval is about the structure, analysis, organization, storage, search and retrieval of information. Generally speaking, information retrieval is to find out the information related to the user's needs from the unstructured information set. One of the core problems of information retrieval is to focus on users and their information needs, because the evaluation of search is user-centered. This concept has led to a great deal of research on how people interact with search engines, especially the development of technologies to help users express their information needs. In the retrieval process, the user submits a short query, the system returns the first query results, and the user marks some of the results as relevant or irrelevant. The system computes a better query to represent the information requirement based on the user's feedback and returns a batch of new retrieval results which are more likely to satisfy the user's needs. This process is called correlation feedback. In the process of information retrieval, the related feedback technique can optimize the query results and improve the query efficiency. In this paper, based on the introduction of the current situation of the correlation feedback technology, the relevant algorithms of the correlation feedback technology are presented, including the vector space model, the probability model and the Boolean model. Among them, the Rocchio correlation feedback algorithm based on vector space model is mainly used. The idea and execution process of the algorithm and the phenomenon that the query effect is not good in some special cases are introduced in detail. For example, the answer set of a query itself requires documents of different classes to compose and words that usually appear in multiple concrete concepts or relationships to improve the Rocchio correlation feedback algorithm. So that the algorithm can also get good results in these two special cases. In this paper, the following contributions are made: (1) when a query statement contains multiple conditional content, according to the idea of Rocchio correlation feedback algorithm, a document set containing two conditional information is considered as a new crossover class, which is within the scope of a cross-class. Starting with the center of mass nearest to the initial query, moving to another center of mass, the desired result is found in the process. The improved Rocchio correlation feedback algorithm can effectively solve the unsatisfactory result of multi-conditional query. (2) in polysemy query, the system returns chaotic and disordered results. In this paper, a hierarchical shrinkage algorithm is designed to cluster the result attributes. The algorithm firstly acquires the key words returned by the system, expresses them with Boolean matrix, then calculates the similarity between documents by taking the number of keywords among documents as a measure, and merges the document hierarchy according to the similarity between documents. The returned label is extracted after clustering. Without considering the recall rate, the final result of the algorithm converges to the key words that are highly expressive to the documents in the cluster, and has a high accuracy.
【学位授予单位】：河南大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】