定向多探头随机超平面局部敏感哈希
发布时间:2018-02-13 10:45
本文关键词: 局部敏感哈希 多探头 余弦相似搜索 出处:《华中师范大学》2017年硕士论文 论文类型:学位论文
【摘要】:随着人类步入大数据时代,人们的衣食住行都离不开信息与数据。相似性搜索是大数据研究的一个重要方向。数据的分析与处理往往离不开对高维数据的匹配与查找。针对于高维数据的大规模相似性搜索,局部敏感哈希被认为是一个行之有效的方法且同时被广泛的运用到各个领域中。在最近的几年,许多有关局部敏感哈希的改进算法相继被提出。在使用局部敏感哈希时,为了提高搜索的质量,往往需要消耗大量的空间来存储哈希表。为了克服这个缺陷,相关的研究者提出了基于多探头的局部敏感哈希。此方法在很大程度上提高了哈希表的利用率,从而节省了存储空间。对于多探头局部敏感哈希,现在最常用的两类探寻方式分别为步进式多探头局部敏感哈希与定向多探头局部敏感哈希。相比于步进式多探头局部敏感哈希,定向多探头局部敏感哈希搜索速度更快,所需探头数量更少。然而如今定向多探头探寻方式是基于E2LSH。这意味着它只能适用于欧式距离。对于余弦相似系数,步进式多探头局部敏感哈希是唯一可执行的多探头探寻方式。提出一种基于定向多探头探寻方式的局部敏感哈希用以弥补余弦相似系数下定向探寻方式的空白。同时给出一套完整的数学理论证明作为算法的理论依据。为了说明该算法的优越性,采用了多个公开数据集分别对定向多探头随机超平面局部敏感哈希与步进式多探头随机超平面局部敏感哈希进行了实验。根据实验结果,前者相比于后者在达到相同的召回率时需消耗更少的探头与搜索时间。算法相比于原始局部敏感哈希有更高的空间利用率。另外定向探寻方式仅仅改变了探寻顺序并没有改变存储空间中哈希表的数据结构。这意味着该算法与现有数据结构并未发生冲突,可以直接被使用于现有的局部敏感哈希算法中。从而在减少了空间消耗的同时也保证了搜索的速度。
[Abstract]:As the human race entered the big data era, Similarity search is an important research direction of big data. Data analysis and processing are often inseparable from the matching and searching of high-dimensional data. Scale similarity search, Locally sensitive hashing is considered to be an effective method and is widely used in various fields. In recent years, many improved local sensitive hashing algorithms have been proposed one after another. In order to improve the quality of the search, it is often necessary to consume a large amount of space to store the hash table. In this paper, the local sensitive hashing based on multi-probe is proposed. This method can greatly improve the utilization rate of hash table and save storage space. The two most commonly used search methods are stepwise multi-probe local sensitive hash and directional multi-probe local sensitive hash respectively. Compared with step multi-probe local sensitive hash, directional multi-probe local sensitive hash search is faster than step multi-probe local sensitive hash. The number of probes required is even smaller. However, today's directional multiprobe search is based on E2LSH. this means that it can only be applied to Euclidean distance. For cosine similarity coefficients, Stepwise multi-probe local sensitive hashing is the only executable multi-probe searching method. A local sensitive hash based on directional multi-probe search is proposed to make up for the blank of orientation search under cosine similarity coefficient. At the same time, a complete set of mathematical theory proof is given as the theoretical basis of the algorithm. Several open datasets are used to test the random hyperplane local sensitive hashes with directional multiprobe and the stepwise multiprobe random hyperplane local sensitive hashes respectively. According to the experimental results, Compared with the latter, the former requires less probe and search time to reach the same recall rate. The algorithm has higher space utilization than the original locally sensitive hash. In addition, the directional search only changes the searching sequence. The order does not change the data structure of the hash table in the storage space. This means that the algorithm does not conflict with the existing data structure. It can be directly used in the existing local sensitive hash algorithms, thus reducing the space consumption and ensuring the speed of search.
【学位授予单位】:华中师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.3;TP311.13
【参考文献】
相关期刊论文 前7条
1 钟川;陈军;;基于精确欧氏局部敏感哈希的改进协同过滤推荐算法[J];计算机工程;2017年02期
2 原默晗;唐晋韬;王挺;;一种高效的分布式相似短文本聚类算法[J];计算机与数字工程;2016年05期
3 王忠伟;陈叶芳;钱江波;陈华辉;;基于LSH的高维大数据k近邻搜索算法[J];电子学报;2016年04期
4 郑奇斌;刁兴春;曹建军;周星;许永平;;结合局部敏感哈希的k近邻数据填补算法[J];计算机应用;2016年02期
5 韩军辉;卢暾;丁向华;顾宁;;基于局部敏感哈希技术的能耗社区实时推荐系统[J];小型微型计算机系统;2015年11期
6 戴上平;冯鹏;刘盛英杰;舒红;;基于余弦距离的局部敏感哈希的KNN算法在中文文本上的快速分类[J];计算机工程与科学;2015年10期
7 曹玉东;刘艳洋;贾旭;王冬霞;;基于改进的局部敏感哈希算法实现图像型垃圾邮件过滤[J];计算机应用研究;2016年06期
,本文编号:1507979
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1507979.html