面向大规模数据的高效LTR调研系统设计与实现

发布时间：2018-06-12 17:34

本文选题：网页排序 + LTR调研系统　；参考：《南京大学》2015年硕士论文

【摘要】：LTR(learning to rank,使用机器学习的方法做网页排序)在商业搜索引擎中起着越来越重要的作用。各大商业搜索引擎都逐渐使用LTR作为搜索网页排序的重要手段。就目前网页排序的发展而言,LTR算法本身对搜索精度的提升已经相对较小,雅虎在2010年举办的LTR算法比赛的结果显示,精度最高的算法和基准算法(GBDT和RankSVM)相比,提升也非常有限,而且这些提升有相当一部分是来自于对训练数据的处理。而随着网页数目的迅速提升,训练集的规模越来越大,LTR需要能够处理这种越来越大的训练集；另外,训练数据的一些非常重要的特征,比如用户点击数据等,会随着时间会快速发生变化,所以训练模型需要快速的进行更新。因此,高效和能够处理更大规模数据是目前对LTR算法的主要需求。除此之外,LTR训练使用的特征较多(可达700特征左右),而这些特征中大部分是带有噪声的,对最终模型的贡献很小,选取合适的特征集合参与训练,既可以提高精度,又可以大大降低训练的时间。怎样选取合适的特征也是LTR研究中的重要部分。LTR调研系统的目的就是快速选出合适的模型供搜索引擎使用,对网页搜索结果进行排序。原始的LTR系统有三个主要问题：缺乏对特征分析和选择的支持；不能处理大规模的数据集；以及训练算法本身的效率较低。这几个问题导致了LTR算法的训练和更新的效率较低,不能适应逐渐增长的数据和快速更新的要求。本文主要针对这三个问题设计实现了新的LTR调研系统,整个系统主要包含三个部分的改进：第一个部分是一个支持大规模数据的可扩展的特征分析平台,利用它进行特征分析,对模型所需特征的选取提供参考,并对最终结果进行一定程度上的解释；第二个部分是一个高效的单机LTR训练算法的实现,可以充分利用新的软硬件环境来提高算法的训练效率；第三个部分是一个用来处理大批量数据的大规模数据树模型的训练平台,包括解决计算资源问题的资源调度模块和支持故障自动恢复的分布式树模型训练模块。最终的结果显示,该调研系统可以将特征和模型参数选择的迭代过程的效率提升一倍以上,并支持大规模数据的处理,在效率和精度两个方面对LTR模型的训练进行提升。
[Abstract]:LTR-learning to rank (using machine learning to sort web pages) plays an increasingly important role in commercial search engines. All major commercial search engines are gradually using LTR as an important means of searching web pages. As far as the development of web ranking is concerned, the LTR algorithm itself has been relatively small in improving the search accuracy. The results of the LTR algorithm competition held by Yahoo in 2010 show that the most accurate algorithm has a very limited improvement compared with the benchmark algorithms (GBDT and RankSVM). And much of these ascent comes from the processing of training data. With the rapid increase in the number of web pages, the size of the training set becomes larger and larger. LTR needs to be able to handle this growing training set. In addition, some very important features of the training data, such as user click data, Will change quickly over time, so the training model needs to be updated quickly. Therefore, the main demand for LTR algorithm is to be efficient and able to deal with larger data. In addition, LTR training uses more features (up to 700 features or so), and most of these features are noisy, so the contribution to the final model is very small. Selecting suitable feature sets to participate in the training can not only improve the accuracy, but also improve the accuracy of LTR training. It can also greatly reduce the training time. How to select suitable features is also an important part of LTR research. The purpose of LTR research system is to quickly select the appropriate model for search engine to sort web search results. The original LTR system has three main problems: lack of support for feature analysis and selection, inability to deal with large-scale data sets, and low efficiency of the training algorithm itself. These problems lead to the low efficiency of training and updating of LTR algorithm, which can not meet the requirements of increasing data and fast updating. In this paper, a new LTR research system is designed and implemented for these three problems. The whole system mainly includes three parts of improvement: the first part is an extensible feature analysis platform supporting large-scale data, which is used for feature analysis. It provides a reference for the selection of the required features of the model and explains the final results to a certain extent. The second part is the implementation of an efficient single-machine LTR training algorithm. We can make full use of the new software and hardware environment to improve the training efficiency of the algorithm. The third part is a training platform for large-scale data tree model to deal with mass data. It includes resource scheduling module for solving computing resource problem and distributed tree model training module for automatic fault recovery. The final results show that the system can improve the efficiency of iterative process of feature and model parameter selection more than twice as well as support large-scale data processing and improve the training of LTR model in both efficiency and accuracy.
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.3

【参考文献】