基于DEA的列表型排序学习方法研究

发布时间：2018-04-09 18:38

本文选题：信息检索　切入点：排序学习　出处：《西南交通大学》2014年硕士论文

【摘要】：互联网的蓬勃发展与数码产品的快速增长,产生了海量的信息,使人们深陷其中无所适从,迫切需要一种能够提供高效便捷的信息检索服务的系统,网络搜索引擎因此而逐渐成为人们获取信息的重要工具。搜索引擎系统包含多个子系统,其中排名系统处于核心地位。排名系统能够根据用户提交的检索词从海量的数据中快速定位最相关的文档集合,并按照相关程度顺次反馈给用户,有效减少用户信息检索的时间开销。为此,研究人员提出了大量的排名算法,主要基于内容分析或链接分析,利用文档的相关性特征、重要性特征评价文档同用户检索意图的契合程度。它们极大地改善了信息检索系统的排名系统,但仍然存在两个重要的缺陷：用于构建排序模型的检索词-文档特征有限；或者在利用大量特征构建排序模型时,优选模型参数成为最大的障碍。排序学习方法是一种机器学习与信息检索的交叉学科,可以从大量的包含人工标记的训练集中自动学习排序模型,并应用于对未知数据的预测分析。排序学习使用的训练实例表示成多维特征的向量形式,包含各种反映文档与检索词相关性的复杂信息。目前,排序学习方法大致可以分成逐点型、序对型和列表型三类,研究表明列表型排序学习方法在大多数公开数据集上的表现最佳。本文重点研究列表型排序学习方法,并利用数据包络分析技术,结合提升技术提出一种新的排名方法——DEARank。本文修改经典的CCR模型,构建出两种退化的数据包络分析模型：CCR-I与CCR-O,将待排名的文档作为决策单元进行处理,并使用过模型最优权值构建弱排名函数集合。每个备选弱排名函数反映了决策单元对于各个特征的偏好,代表从整个特征空间抽取的一个特征子集。本文利用这些备选弱排名函数,基于提升技术训练性能更优的排序模型。此外,本文还就DEARank在LETOR数据集(包括HP2003、HP2004、 NP2003、NP2004、TD2003、TD2004、OHSUMED、MQ2007与MQ2008)上的实证结果,同其它十二个经典的排序学习算法进行对比,实验结果表明DEARank有突出表现,给网络信息检索系统提供了一个重要的排名算法。
[Abstract]:With the rapid development of the Internet and the rapid growth of digital products, mass information is produced, and people are trapped in it, so they urgently need a system that can provide efficient and convenient information retrieval services.As a result, the network search engine has gradually become an important tool for people to obtain information.Search engine system includes many subsystems, in which ranking system is at the core.The ranking system can quickly locate the most relevant document set from the massive data according to the key words submitted by the user and feedback to the user according to the correlation degree in order to effectively reduce the time cost of user information retrieval.For this reason, researchers put forward a large number of ranking algorithms, mainly based on content analysis or link analysis, using the relevant features of documents, importance features to evaluate the document and user retrieval intention of the degree of agreement.They greatly improve the ranking system of the information retrieval system, but there are still two important shortcomings: the limited feature of the document used to build the sorting model, or the use of a large number of features to build the sorting model.Optimal selection of model parameters is the biggest obstacle.Sorting learning is an interdiscipline between machine learning and information retrieval. It can automatically learn the sorting model from a large number of training sets containing manual markers and be applied to the prediction and analysis of unknown data.The training example used in sorting learning is expressed as a vector form of multidimensional features and contains a variety of complex information reflecting the correlation between documents and search words.At present, sorting learning methods can be divided into three types: point-by-point, order-pair and table-type.This paper focuses on the list ranking learning method, and proposes a new ranking method named DEARankusing the data Envelopment Analysis (DEA) technique and combining with the lifting technique.In this paper, we modify the classical CCR model and construct two degenerated data envelopment analysis models: CCR-I and CCR-O. the documents to be ranked are treated as decision making units, and the weak rank function set is constructed by using the optimal weights of the model.Each candidate weak rank function reflects the preference of the decision making unit for each feature and represents a feature subset extracted from the entire feature space.In this paper, these alternative weak rank functions are used to improve the performance of technical training based on a better ranking model.In addition, the empirical results of DEARank on LETOR data sets (including HP2003 / HP2004, NP2003 / NP2004 / TD2004 / TD2004 / OHSUMEDU MQ2007 and MQ2008) are compared with the other 12 classical sorting learning algorithms. The experimental results show that DEARank has outstanding performance.It provides an important ranking algorithm for network information retrieval system.
【学位授予单位】：西南交通大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：O223

【共引文献】