M2LSH:基于LSH的高维数据近似最近邻查找算法
发布时间:2018-12-07 19:32
【摘要】:在许多应用中,LSH(Locality Sensitive Hashing)以及各种变体,是解决近似最近邻问题的有效算法之一.虽然这些算法能够很好地处理分布比较均匀的高维数据,但从设计方案来看,都没有针对数据分布不均匀的情况做相应的优化.针对这一问题,本文提出了一种新的基于LSH的解决方案(M2LSH,2 Layers Merging LSH),对于数据分布不均匀的情况依然能得到一个比较好的查询效果.首先,将数据存放到具有计数功能的组合哈希向量表示的哈希桶中,然后通过二次哈希将这些桶号投影到一维空间,在此空间根据各个桶中存放的数据个数合并相邻哈希桶,使得新哈希桶中的数据量能够大致均衡.查询时仅访问有限个哈希桶,就能找到较优结果.本文给出了详细的理论分析,并通过实验验证了M2LSH的性能,不仅能减少访问时间,也可提高结果的正确率.
[Abstract]:In many applications, LSH (Locality Sensitive Hashing) and various variants are one of the effective algorithms to solve the approximate nearest neighbor problem. Although these algorithms can deal with the high dimensional data with uniform distribution well, from the point of view of the design scheme, there is no corresponding optimization for the non-uniform distribution of the data. In order to solve this problem, a new solution based on LSH (M2LSH2 Layers Merging LSH),) is proposed in this paper. First, the data is stored in the hash bucket represented by the combined hash vector with counting function, and then these bucket numbers are projected to one dimensional space by the quadratic hash, where the adjacent hash buckets are merged according to the number of data stored in each bucket. So that the new hash bucket in the amount of data can be roughly balanced. The query can only access a limited number of hash buckets to find a better result. In this paper, a detailed theoretical analysis is given, and the performance of M2LSH is verified by experiments, which can not only reduce the access time, but also improve the accuracy of the results.
【作者单位】: 宁波大学信息科学与工程学院;
【基金】:国家自然科学基金(No.61472194,No.61572266) 浙江省自然科学基金(No.LY13F020040,No.LY16F020003) 宁波市自然科学基金(No.2014A610023,No.2015A610119)
【分类号】:TP301.6
,
本文编号:2367724
[Abstract]:In many applications, LSH (Locality Sensitive Hashing) and various variants are one of the effective algorithms to solve the approximate nearest neighbor problem. Although these algorithms can deal with the high dimensional data with uniform distribution well, from the point of view of the design scheme, there is no corresponding optimization for the non-uniform distribution of the data. In order to solve this problem, a new solution based on LSH (M2LSH2 Layers Merging LSH),) is proposed in this paper. First, the data is stored in the hash bucket represented by the combined hash vector with counting function, and then these bucket numbers are projected to one dimensional space by the quadratic hash, where the adjacent hash buckets are merged according to the number of data stored in each bucket. So that the new hash bucket in the amount of data can be roughly balanced. The query can only access a limited number of hash buckets to find a better result. In this paper, a detailed theoretical analysis is given, and the performance of M2LSH is verified by experiments, which can not only reduce the access time, but also improve the accuracy of the results.
【作者单位】: 宁波大学信息科学与工程学院;
【基金】:国家自然科学基金(No.61472194,No.61572266) 浙江省自然科学基金(No.LY13F020040,No.LY16F020003) 宁波市自然科学基金(No.2014A610023,No.2015A610119)
【分类号】:TP301.6
,
本文编号:2367724
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2367724.html