异质数据相似度学习及其在网络搜索中的应用

发布时间：2018-09-07 07:53

【摘要】：本文研究异质数据相似度学习的问题，以及相似度学习在网络搜索中的应用。相似度学习在网络搜索，推荐系统，图片标注以及机器翻译等诸多应用中都扮演着重要的角色。本质上来说，这些应用的任务都可以归结为学习并利用一个相似度函数来匹配两种异质的实例。这两种实例在网络搜索中是查询和文档，在推荐系统中是用户和物品，在图片标注中是关键词和图片，在机器翻译中是两种语言下的翻译。特别的，在网络搜索中，搜索引擎是产生查询文档匹配的媒介网络上信息的急剧膨胀使人们的生活越来越离不开搜索引擎。搜索引擎的任务是对不同用户提出的查询检索相关文档，并根据其相关性大小产生文档排序。查询与文档是两种异质实例，它们的相关性由它们之间的相似度决定。相似度函数的好坏直接决定了搜索引擎性能的优劣。本文定义希尔伯特空间的内积作为相似度函数。具体来说，本文为两种异质实例分别定义一个映射函数。映射函数将异质实例映射到相同的希尔伯特空间然后映射像的内积被定义为相似度函数。在这样的定义下，本文考虑以两种方式学习异质数据的相似度：(1)先学习映射函数，然后再计算映射像的内积得到相似度函数；(2)直接学习相似度函数。在每一种方式下，本文试图解决三个问题(1)如何综合利用来自不同源的各种信息。例如，在网络搜索中，查询与文档的内容以及用户点击数据(click through data)都可以被用来学习相似度函数；(2)如何提高学习算法的效率及扩展性(scalability)，使其能够处理海量的数据；(3)如何分析学习算法的泛化能力。本文首先考虑先学习映射，再通过映射像的内积定义相似度函数。特别的，本文考虑学习两个线性映射，那么最后的相似度函数由一个双线性型表示。在这种方法下，本文为线性映射定义了两种假设空间。首先，我们要求线性映射的列单位正交。在这个假设下，本文提出了一个多视角(multi-view)的学习方法。该方法能有效利用来自不同源的各种信息。随后，为了提高学习的效率和扩展性，本文又给出了一个正则化的方法。具体来说，我们约束线性映射行向量的l_1范数和l_2范数。这个假设保证了解的稀疏性，同时使得算法很容易并行化。最后，本文还系统地研究了相似度学习方法的泛化能力。随后，，本文考虑直接定义相似度函数的假设空间来学习异质数据相似度函数。特别的，本文利用了机器学习中的核方法，提出了一种基于核的相似度学习。具体来说，本文定义了一种特殊的半正定核：S-核。一个S-核可以生成一个相似度函数的假设空间。核方法可以保证解的最优性以及它的泛化能力。为了提高学习算法的效率，本文提出了一个算法的在线近似。我们将异质数据相似度学习应用到网络搜索中，并说明本文提出的学习方法可以解决网络搜索中的词语不匹配(term mismatch)问题。我们在真实的大规模企业搜索数据和网络搜索数据上进行了实验。实验效果表明，本文提出的方法可以有效地克服词语不匹配问题，显著地改善传统方法在相关性排序，以及相似查询发现上的表现。
[Abstract]:This paper studies the similarity learning of heterogeneous data and the application of similarity learning in Web search. Similarity learning plays an important role in many applications such as web search, recommendation system, image annotation and machine translation. Essentially, the tasks of these applications can be summed up as learning and utilizing a phase. The similarity function matches two heterogeneous instances. These two instances are queries and documents in network search, users and objects in recommendation system, keywords and pictures in image annotation, and translations in two languages in machine translation. In particular, search engines are the media networks that produce query document matches in network search. The rapid expansion of information on the Internet makes people's lives more and more inseparable from search engines. The task of search engines is to retrieve relevant documents from queries submitted by different users and to sort them according to their relevance. Queries and documents are two heterogeneous instances whose correlation is determined by their similarity. In this paper, the inner product of Hilbert space is defined as similarity function. Specifically, two kinds of heterogeneity are discussed.
The mapping function maps heterogeneous instances to the same Hilbert space and the inner product of the mapping image is defined as a similarity function. Under this definition, this paper considers two ways to learn the similarity of heterogeneous data: (1) First, the mapping function is studied, and then the inner product of the mapping image is calculated. In each way, this paper attempts to solve three problems: (1) how to synthesize information from different sources. For example, in Web search, both the content of query and document and the click through data can be used to learn similarity functions; (2) how to Improve the efficiency and scalability of the learning algorithm, so that it can deal with massive data; (3) How to analyze the generalization ability of the learning algorithm.
In this paper, we first consider learning mappings and then defining similarity functions by the inner product of the mapping image. In particular, we consider learning two linear mappings, and then the final similarity function is represented by a bilinear form. Orthogonal. Under this assumption, a multi-view learning method is proposed. This method can effectively utilize information from different sources. Subsequently, in order to improve the efficiency and scalability of learning, a regularization method is given. Specifically, we constrain the l_1 norm and l_2 norm of linear mapping row vectors. This assumption guarantees the sparsity of the solution and makes the algorithm easy to parallelize. Finally, the generalization ability of similarity learning methods is systematically studied.
Then, we consider directly defining the hypothesis space of the similarity function to learn the similarity function of heterogeneous data. In particular, we propose a kernel-based similarity learning by using the kernel method in machine learning. In order to improve the efficiency of the learning algorithm, an on-line approximation of the algorithm is proposed.
We apply heterogeneous data similarity learning to network search, and show that the proposed learning method can solve the term mismatch problem in network search. We experimented on real large-scale enterprise search data and network search data. It effectively overcomes the problem of word mismatch and significantly improves the performance of traditional methods in relativity ranking and similar query discovery.
【学位授予单位】：北京大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP391.3

【相似文献】