基于量化的近似最近邻搜索技术研究

发布时间：2018-01-01 06:41

本文关键词：基于量化的近似最近邻搜索技术研究　出处：《中国科学技术大学》2017年博士论文　论文类型：学位论文

【摘要】：最近邻搜索是机器学习、计算机视觉和信息检索里一个重要的基础性问题。然而,在大规模高维数据环境下,给定查询点,找到其精确的最近邻需要大量的计算及存储空间。近似最近邻搜索算法由于其存储空间少、查找效率高等优点引起了人们的广泛关注。而如何快速、高效、准确地进行近似最近邻搜索是目前学术研究的一个热点和难点。一般来说,近似最近邻搜索的算法在尽可能保证其准确性的情况下主要从两个方面提高搜索速度。第一个是利用特殊的数据结构来减少查询点与数据点的比较次数;第二个是利用紧凑码来加速计算查询点与数据点之间的距离,比如通过哈希算法或量化算法将数据点映射为紧凑码。本文主要从第二个方面——基于量化的近似最近邻搜索算法——研究如何获得更优质的紧凑码来提高查找准确率和查找效率。本文主要研究内容和创新成果如下:1.针对无监督的近似最近邻搜索,本文提出一种组合量化方法。其主要思想是用若干个子中心点之和作为重构点来近似数据点,其中每个子中心点来自不同的子字典,数据点用这些子中心点在各自子字典中的索引值来表示。同时,我们引入近似正交约束条件,使得计算查询点与重构点的距离可以用查询点和这几个子中心点的距离之和来代替进而加速距离计算。与已有的量化方法的对比实验结果表明,近似正交的组合量化可以获得更高的查找准确率。2.本文提出一种稀疏组合量化算法,用以减少组合量化中创建查阅表所需的时间。大规模数据的近似最近邻搜索通常结合倒排表进一步加速搜索。而组合量化在对倒排表返回的数据点进行排序的时候,创建查阅表所需的时间变得不可忽视。针对这一问题,本文提出的稀疏组合量化方法,引入了一个稀疏条件,使得重构字典里的每一个子中心点是一个稀疏向量。其好处是,当创建查阅表需要计算查询点与子中心点的欧氏距离的时候,由于子中心点是一个稀疏向量,可以加速距离计算。在大规模数据集上的近似最近邻搜索表明,稀疏组合量化相比较于组合量化,可以获得更快的查找速度。3.本文提出基于量化的近似最近邻搜索算法用于跨模态最近邻搜索领域中。所谓跨模态最近邻搜索,指的是查询点和数据点来自不同的数据模态,例如用图像查询点去搜索相似的文本数据点,或用文本查询点去搜索相似的图像数据点。本文提出的算法只假设一幅图像和一段文本是一一对应的,而不需要已知图像和文本的类别。该算法首先将来自不同模态的一对数据映射到同一空间中,之后在这个映射后的空间对不同模态的数据通过组合量化进行近似,同时使来自不同模态的一对数据的近似表示尽可能相同。大量的实验比较表明,本文提出的算法在跨模近似态最近邻搜索中可以获得更高的查找准确率。4.针对有监督近似最近邻搜索,本文提出了一种新的量化方法。不同于无监督近似最近邻搜索,量化算法直接在数据库上进行量化,本文提出的算法是使数据点首先通过一个线性变换,之后在线性变换后的数据点上进行组合量化。其优化的目的不仅要使得量化后的近似表达能准确地代表线性变换后的数据点,同时也使得数据点在线性变换后具有类别可分离性,即相同类别的数据点在线性变换后距离很近,不同类别的数据点在线性变换后的空间内相距很远。与现有的有监督近似最近邻搜索算法的实验比较表明,本文提出的算法可以获得更高的查找准确率。综上,本文在无监督的近似最近邻搜索,跨模态的近似最近邻搜索,以及有监督的近似最近邻搜索这三个领域提出了四个新颖的算法,用于提高近似最近邻搜索的查找准确率以及查找效率。大量实验结果表明了本文提出的方法的查找结果好于已有方法的查找结果。
[Abstract]:Nearest neighbor search is machine learning, computer vision and information retrieval in a fundamental question. However, in large scale and high dimensional data environment, a query point, to find the exact nearest neighbor needs a large amount of calculation and storage space. The approximate nearest neighbor search algorithm due to its less storage space, higher search efficiency has aroused people's attention. How to fast, efficient, accurate approximate nearest neighbor search is a research hotspot and difficulty. In general, approximate nearest neighbor search algorithm in as far as possible to ensure the accuracy of the situation mainly from two aspects to improve the search speed. The first is to use data the special structure to reduce the number of comparisons with the query point data point; the second is the use of compact code to accelerate the calculation between the query point and the data points in the distance, for example by hashing or quantity The algorithm will map the data points for the compact code. This paper mainly from second aspects: quantification of the approximate nearest neighbor search algorithm, study how to obtain better quality compact codes based on to improve the search accuracy and search efficiency. In this paper, the main research contents and innovations are as follows: 1. for approximate nearest neighbor search unsupervised, is proposed in this paper. A combination of quantitative methods. The main idea is to use a number of sub center as a reconstructed point to approximate the data points, where each sub center from different sub dictionaries, data points with these sub center point in each sub dictionary index to indicate the value. At the same time, we introduce approximate orthogonal constraint condition, the the calculation and reconstruction of the distance from the query point can be used in this query point and several sub center distance and instead thereby accelerating the distance calculation. Compared with the existing methods of quantification The results show that the combination of quantitative approximate orthogonality can obtain higher search accuracy of.2. this paper presents a combination of sparse quantization algorithm is used to reduce the time required to create table combination quantification. The large-scale data approximate nearest neighbor search is usually combined with inverted list to further accelerate the search. And when the sort of quantitative combination inverted list returned by the data points, the time required to create a lookup table is not to be ignored. To solve this problem, a combination of sparse quantization method is proposed in this paper, the introduction of a sparse condition, making each sub center reconstruction dictionary is a sparse vector. Its advantage is that when creating access the table needs to calculate the query point and sub center of the Euclidean distance, the sub center is a sparse vector, can accelerate the approximate distance calculation. In a large data set of nearest neighbor search show, Compared with the combination of sparse combination quantization quantization, can obtain faster searching speed of.3. this paper puts forward the quantitative approximate nearest neighbor search algorithm for nearest neighbor search in the field of cross modal based on cross modal. The so-called nearest neighbor search, refers to the query and data points from the data modal different, such as image query point to search for text data similar, or text query point to search for similar image data. This algorithm only if an image and a text is one-to-one, but does not need to know the image and text categories. Firstly, a mapping of data from different modes in the same space, after in the mapping space on the different modes of data through the combination of quantitative approximation, and the approximation of the data from different modes of representation as much as possible the same experiments than. Show that the proposed algorithm in cross modal approximate nearest neighbor search in the state can get a higher search accuracy of.4. for supervised approximate nearest neighbor search, this paper proposes a new quantitative method. Different from unsupervised approximate nearest neighbor search, quantization algorithm directly in the database on the quantification. The proposed algorithm is the first data point by a linear transformation, after the combination of quantization in a linear transformation of data points. The purpose is not only to make the approximate expression can accurately represent the linear transformation of the quantized data points, but also makes the data points in a linear transformation of class separability. That is the same type of data points in a linear transformation was very close, different categories of data points far apart in a linear transformation space. With the existing supervised approximate nearest neighbor search algorithm than experiment Show that the proposed algorithm can obtain higher search accuracy. To sum up, based on the approximate nearest neighbor search unsupervised, cross modal approximate nearest neighbor search, and the approximate nearest neighbor search in these three areas proposed four novel supervised algorithm, used to improve the approximate nearest neighbor search search the accuracy and efficiency of searching. The experimental results show that the method proposed in this paper to find the result is better than the existing methods of search results.

【学位授予单位】：中国科学技术大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP391.3

【相似文献】