高维数据流上的K近邻问题研究

发布时间：2018-01-01 16:42

本文关键词：高维数据流上的K近邻问题研究　出处：《山东大学》2016年博士论文　论文类型：学位论文

【摘要】：随着计算机和网络技术的发展,数据收集变得越来越容易,数据的规模和复杂性越来越高,数据的维度(属性)也越来越高。通常情况下各种类型的数据,如Web文档、图片、基因表达数据及多媒体数据等,可以通过一些特征抽取的方法被表示成高维的向量(或者高维空间中的点)。数据库、信息检索和数据挖掘等领域的一个常见操作,就是在一个对象集中找到某个查询(或某个查询集中所有查询)的最相似的K个对象,如果用一个距离度量(如欧式距离)来衡量相似度,这个操作就可以转化为求高维空间K近邻(或K近邻连接)的问题,即在一个对象集中查找离查询点(或一个查询集中的每个点)距离最近的K个对象。目前高维空间中的K近邻和K近邻连接算法主要针对静态的数据集,而越来越多的应用需要在数据流上连续不断地获得查询结果,这就对高维K近邻(及连接)问题提出了新的挑战。本文对高维数据流上的K近邻相关问题进行了研究。首先,我们研究了高维数据流上的连续K近邻连接问题。由于被查询点集在数据流上会不断发生变化,因此查询集的K近邻连接结果需要实时更新。本文提出采用增量更新策略：先求解更新点的反向K近邻结果,即查询集中K近邻结果会受更新点影响的那些查询,然后只更新这一部分查询的K近邻结果,以避免重复计算。本文提出了一种新的索引结——HDR-树,来提高求解更新点的反向K近邻结果的效率。HDR-树主要利用主成分分析(PCA)降维和聚类划分的技术来提高查询效率。进一步地,由于在大多数的应用下,求得近似K近邻连接查询结果也是可以接受的,本文研究了在高维数据流上近似地得到K近邻连接结果的方法,以进一步减少处理时间。本文提出了两种近似K近邻连接的索引结构和算法：HDR*-树和LSH-M。HDR*-树结合了随机投影降维和聚类技术,由于随机投影降维后的数据能够近似保留原数据点之间的距离关系,HDR*-树在求解受更新点影响的查询点时具有更高的效率。LSH-M则利用了局部敏感哈希的思想来构建索引,并提供有准确率保障的快速检索。本文分析了两种方法的查询准确率,并用实验结果验证了它们的效率。最后,为了进一步提高查询效率和可扩展性,本文研究了高维数据流上的分布式K近邻(及连接)查询的问题,并设计了分布式索引和查询算法来提高查询效率。本文提出了一个新的索引结构,叫作动态环索引(Dynamic Bounded Rings Index),它首先找到一个参照点集,把数据流上的点分配到最近的参照点,然后对每个参照点上的数据子集按照到参照点的距离进一步划分到更细粒度的有界环型内,环的边界可以动态维护以适应数据流上点的不断更新,并且由于每个动态环可以分配到不同的节点上处理,它可以很容易地应用到分布式环境中。在动态环索引的基础上,本文设计了分布式高维K近邻查询算法,该算法的主要优点是在只进行两次迭代的情况下就能得到准确的K近邻结果,减少了在分布式环境中进行K近邻查询的通信代价和计算代价。本文将提出的索引和算法在开源的分布式流处理平台Apache Storm上进行了实现,在真实数据集和模拟数据集上的实验结果都显示我们的算法对比已有的基准方法在查询效率上有很大提升。
[Abstract]:With the development of computer and network technology, data collection becomes more and more easy, the size of the data and more complex data, the dimensions (attributes) are increasingly high. Usually, various types of data, such as Web documents, pictures, gene expression data and multimedia data, through the method of characteristics the extraction is represented as a high-dimensional vector (or points in high dimensional space). A common operation database, information retrieval and data mining, is a focus to find a query (or a set of queries in all queries) is most similar to the K object, if a distance metric (e.g. Euclidean distance) to measure the similarity, this operation can be converted to a high dimensional space (K or K nearest neighbor neighbor connection) problem, namely in a focused search from the query point (each point or a query set distance) In the K object. At present in the high dimensional space of K nearest neighbor and K nearest neighbor joining algorithm is mainly based on the static data sets, and more and more applications in data stream continuously get the query results, the high dimensional K nearest neighbor (and connected) has brought new challenges. This paper study on related problems of K nearest neighbor flow on high dimensional data. Firstly, we study the continuous K nearest neighbor high dimensional data stream connection problem. Because the query point sets in data streams will change constantly, so the query K nearest neighbor set connection results need to be updated in real time. This paper proposed the incremental update strategy: the first is to update the reverse K nearest neighbor points, those queries that query results will be updated by the K nearest neighbors of the K nearest neighbor results and then update only this part of the query, to avoid repeated calculation. This paper proposes a new index node - H DR- tree, to improve the solution of reverse K nearest neighbor results point to the updating efficiency of.HDR- tree by using principal component analysis (PCA) dimension reduction clustering technology to improve query efficiency. Further, because in most applications, the approximate K nearest neighbor query results can also be connected to accept the method were studied in this paper. High dimensional data stream is approximately K neighbor connection results, to further reduce the processing time. This paper presents two kinds of index structure and algorithm of approximate K nearest neighbor connect: HDR*- tree and LSH-M.HDR*- tree with random projection dimensionality reduction and clustering technology, the random projection dimensionality reduction data can keep approximate distance between the original data points, the HDR*- tree by the update effect in solving the query point has higher efficiency of.LSH-M by using the locality sensitive hashing method to construct the index, and provide the accurate rate of insurance Fast retrieval system. This paper analyzes two methods of query accuracy, and verify their efficiency with the experimental results. Finally, in order to further improve the query efficiency and scalability, this paper studies the distributed K nearest neighbor on the high dimensional data stream (and connected) query, and design a distributed index and the query algorithm to improve the query efficiency. This paper proposes a new index structure, called dynamic ring index (Dynamic Bounded Rings Index), it is first to find a reference point set, the data flow on the points assigned to the nearest reference point, then for each subset of data points on the reference according to the reference point the distance is further divided into more fine-grained bounded ring, ring boundary can be maintained dynamically in order to adapt to the data flow on the update, and because each dynamic ring can be assigned to different nodes, it can be Easily applied to distributed environment. Based on dynamic ring index, this paper designs a distributed high dimensional K nearest neighbor query algorithm, the main advantage of this algorithm is in only two iterations can be obtained under the condition of K nearest neighbor accurate result, reduces the communication cost and computation cost for nearest neighbor queries in distributed K in the environment. This paper will put forward the index and algorithm in open distributed stream processing platform Apache Storm is achieved in real and synthetic data sets. The experimental results show our algorithm compared with the existing reference methods have greatly improved the efficiency of query processing.

【学位授予单位】：山东大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】