正负相关反馈与查询扩展技术的研究

发布时间：2018-06-06 15:58

本文选题：信息检索 + 相关反馈　；参考：《内蒙古大学》2012年博士论文

【摘要】：信息获取在人们的工作、生活等各种活动中占有重要的地位,获取信息的渠道和方法是多种多样的。随着计算机网络、移动通信和全球信息化的快速发展,通过Web网络和搜索引擎获得信息已成为人们的生活和工作习惯,是获取信息的重要方法之一。信息分布广泛、形态多样、组织开放、管理松散、更新快、变化快、传输快等多种因素致使信息检索难度加大。人们对信息检索从结果、效率和方式都提出了更高、更多样化的要求,这些特性和需求对信息检索提出了更大的挑战。搜索引擎必须有强大的、先进的信息检索技术支撑,才能更好的满足用户的要求。通常用户对信息需求的表达不够准确和清晰,往往只是几个单词,经常得不到满意的检索结果。通过反馈扩展查询模型是一种常用的和有效的提高检索性能的策略,因此扩展查询和反馈技术一直是信息检索领域中研究的重点之一。大量的有关这方面的研究工作主要集中在相关反馈和伪相关反馈方面,近几年开始关注负反馈的研究。但在语言模型框架下的正反馈与负反馈相结合的模型的研究在SIGIR中还不曾见到。本文以正负反馈模型为研究核心,围绕该核心对其模型框架、正负反馈的自动识别、模型参数动态调整、多主题反馈等展开研究工作,并取得了以下主要成果。 (1)正负反馈模型框架：基于已有的相关反馈、伪相关反馈和负反馈的研究基础,提出了一种基于语言模型的正负反馈相结合的检索模型框架,相关反馈、伪相关反馈和负反馈等模型均是该模型的特例。正反馈增强放大查询信息,负反馈有效地抑制查询噪音与正反馈内含噪音,有效地提高了检索性能。在平均查准率和前10个文档的查准率方面都超过了伪相关反馈模型和相关反馈模型,和伪相关反馈模型相比大幅地减少了受损的查询数量,提高了鲁棒性。正负反馈模型参数的动态调整：正负反馈模型是由查询、正反馈和负反馈三种成分通过线性插入法混合而成,任何一种混合成分的检索模型其检索结果对各成分比例系数是比较敏感的。针对正负反馈模型提出了两种简单、可行而有效的动态调整参数的算法,一种是依据不相关文档的比例计算法,一种通过训练集学习参数值发,从而进一步提高了正负反馈模型的检索性能。 (2)聚类区分相关和不相关文档：在分析top k文档中相关与不相关文档的分布特点的基础上,通过理论分析和实验发现密度聚类算法能够很好地识别孤立的不相关文档,对密度聚类算法DBSCAN进行改进,以72%以上的准确率和32%的查全率发现top k中的不相关文档,以54%以上的准确率和87%以上的查全率发现top k中的相关文档。将top k分为两个集合,连通集合和孤立点集合,把它们分别做为正负反馈模型中的正、负反馈,检索性能远超于伪相关反馈。 (3)用多主题域改进伪相关反馈模型：提出了一种新的应用多主题域信息改进伪相关反馈的模型,重构查询是由原查询、top k和多主题域中的top s组成,能够有效地改进伪相关反馈的检索性能。该方法能够应用于个性化检索中。
[Abstract]:Information acquisition plays an important role in people's work, life and other activities. The channels and methods of obtaining information are varied. With the rapid development of computer network, mobile communication and global information, obtaining information through Web network and search engine has become a habit of people's life and work, and it is a heavy information acquisition. One of the methods is that information is widely distributed, the form is diverse, the organization is open, the management is loose, the update is fast, the change is fast, the transmission is fast and so on, which makes the information retrieval more difficult. People have put forward higher and more samples from the results, efficiency and ways of information retrieval, and these characteristics and requirements put forward more challenges to information retrieval. Search engine must have strong and advanced information retrieval technology to support users in order to better meet the requirements.
Generally, the expression of the user's information requirement is not accurate and clear, often only a few words, often without satisfactory retrieval results. Through feedback extending the query model is a common and effective strategy to improve the retrieval performance. Therefore, the extended query and feedback technology has always been one of the key points in the field of information retrieval. The research work on this aspect is mainly focused on the related feedback and pseudo correlation feedback. In recent years, the research on negative feedback has been concerned. However, the research on the combination of positive feedback and negative feedback in the framework of language model has not been seen in SIGIR. The research has been carried out in the framework of automatic recognition of positive and negative feedback, dynamic adjustment of model parameters, and multi topic feedback, and the following main achievements have been achieved.
(1) the framework of positive and negative feedback model: Based on the related feedback, pseudo correlation feedback and negative feedback, a retrieval model framework based on positive and negative feedback based on language model is proposed. Correlation feedback, pseudo correlation feedback and negative feedback are all special examples of the model. Positive feedback enhancement amplification query information, negative feedback The query noise and positive feedback noise are effectively suppressed, and the retrieval performance is effectively improved. The average precision and the precision of the first 10 documents are more than the pseudo correlation feedback model and the correlation feedback model. Compared with the pseudo correlation feedback model, the number of damaged inquiries is greatly reduced and the robustness is improved.
The dynamic adjustment of the parameters of the positive and negative feedback model: the positive and negative feedback model is composed of three components: query, positive feedback and negative feedback. The retrieval results of any kind of mixed component are sensitive to the proportion coefficient of each component. Two simple, feasible and effective methods are proposed for the positive and negative feedback model. The algorithm for dynamic adjustment of parameters is based on the proportional calculation method of unrelated documents, and a training set is used to learn the value of parameters, thus further improving the retrieval performance of the positive and negative feedback model.
(2) clustering correlation and unrelated documents: on the basis of analyzing the distribution characteristics of related and unrelated documents in top k documents, the density clustering algorithm can identify isolated unrelated documents well through theoretical analysis and experiment, and improve the density clustering algorithm DBSCAN with more than 72% accuracy and 32% recall. The unrelated documents in the present top k are found in the relevant documents in top k with more than 54% accuracy and more than 87% recall. The top k is divided into two sets, connected sets and outlier sets, and they are respectively used as positive and negative feedback in the positive and negative feedback model, and the retrieval performance is far more than pseudo correlation feedback.
(3) using multi topic domain to improve pseudo correlation feedback model: a new model for applying multi topic domain information to improve pseudo correlation feedback is proposed. The reconfigurable query is composed of the original query, top k and the top s in the multi topic domain. It can effectively improve the retrieval performance of pseudo correlation feedback. This method can be applied to personalized retrieval.
【学位授予单位】：内蒙古大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP391.3

【相似文献】