面向大规模数据的高效LTR调研系统设计与实现

发布时间：2018-06-12 17:34

本文选题：网页排序 + LTR调研系统　；参考：《南京大学》2015年硕士论文

【摘要】：LTR(learning to rank,使用机器学习的方法做网页排序)在商业搜索引擎中起着越来越重要的作用。各大商业搜索引擎都逐渐使用LTR作为搜索网页排序的重要手段。就目前网页排序的发展而言,LTR算法本身对搜索精度的提升已经相对较小,雅虎在2010年举办的LTR算法比赛的结果显示,精度最高的算法和基准算法(GBDT和RankSVM)相比,提升也非常有限,而且这些提升有相当一部分是来自于对训练数据的处理。而随着网页数目的迅速提升,训练集的规模越来越大,LTR需要能够处理这种越来越大的训练集；另外,训练数据的一些非常重要的特征,比如用户点击数据等,会随着时间会快速发生变化,所以训练模型需要快速的进行更新。因此,高效和能够处理更大规模数据是目前对LTR算法的主要需求。除此之外,LTR训练使用的特征较多(可达700特征左右),而这些特征中大部分是带有噪声的,对最终模型的贡献很小,选取合适的特征集合参与训练,既可以提高精度,又可以大大降低训练的时间。怎样选取合适的特征也是LTR研究中的重要部分。LTR调研系统的目的就是快速选出合适的模型供搜索引擎使用,对网页搜索结果进行排序。原始的LTR系统有三个主要问题：缺乏对特征分析和选择的支持；不能处理大规模的数据集；以及训练算法本身的效率较低。这几个问题导致了LTR算法的训练和更新的效率较低,不能适应逐渐增长的数据和快速更新的要求。本文主要针对这三个问题设计实现了新的LTR调研系统,整个系统主要包含三个部分的改进：第一个部分是一个支持大规模数据的可扩展的特征分析平台,利用它进行特征分析,对模型所需特征的选取提供参考,并对最终结果进行一定程度上的解释；第二个部分是一个高效的单机LTR训练算法的实现,可以充分利用新的软硬件环境来提高算法的训练效率；第三个部分是一个用来处理大批量数据的大规模数据树模型的训练平台,包括解决计算资源问题的资源调度模块和支持故障自动恢复的分布式树模型训练模块。最终的结果显示,该调研系统可以将特征和模型参数选择的迭代过程的效率提升一倍以上,并支持大规模数据的处理,在效率和精度两个方面对LTR模型的训练进行提升。
[Abstract]:In order to improve the search accuracy , the LTR algorithm itself has a relatively small improvement in search accuracy . As the number of web pages increases rapidly , the scale of the training set becomes larger and larger , and the LTR needs to be able to handle the more and more training sets .
In addition , some very important features of the training data , such as user ' s click data , will change rapidly over time , so the training model needs to be updated quickly .
the large - scale data set cannot be processed ;
The efficiency of the training algorithm is low , which leads to the lower efficiency of the training and updating of the LTR algorithm , which can not adapt to the increasing data and the requirement of fast update . The paper mainly focuses on the three parts : the first part is an extensible characteristic analysis platform which supports large - scale data , and the first part is a scalable characteristic analysis platform which supports large - scale data , and the characteristic analysis is carried out to provide reference for the selection of the characteristics required by the model , and the final result is explained to some extent ;
The second part is an efficient single - machine LTR training algorithm , which can make full use of the new hardware and software environment to improve the training efficiency of the algorithm ;
The third part is a training platform for large - scale data tree model to deal with large - scale data , including the resource scheduling module for solving the problem of computing resources and the distributed tree model training module supporting the automatic recovery . The result shows that the research system can double the efficiency of the iterative process of the characteristic and model parameter selection , and support the large - scale data processing , and improve the training of the LTR model in terms of efficiency and precision .
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.3

【参考文献】