基于倒排索引的集合T覆盖查询算法研究
[Abstract]:T coverage is the basis of set similarity query. In recommendation system, document approximate repeat detection and deletion, automatic completion of web search, entity parsing, document clustering, plagiarism detection, data cleaning, social network mining, Gene sequence alignment has important applications in many fields. The efficiency of T-coverage query under the traditional index structure is low, and the T-coverage calculation based on bitmap index can be represented by simple and fast logic bit operation such as AND,OR, so the computation efficiency is high. The T-overlay query based on bitmap index usually uses compressed bitmap, but because of the complexity of update operation, compressed bitmap is not suitable for the occasion of frequent data update. In order to solve this problem, this paper first designs an efficient segmented bitmap index structure, SBII (Segmented Bitmap Inverted Indexes), SBII, which not only takes into account the storage space overhead of bitmaps, but also makes full use of the characteristics of high computing speed of bitmaps. It also avoids the problem of updating compressed bitmap. Secondly, the T-coverage query algorithm SBII_Looped. is implemented on the basis of SBII. Finally, the experiment on the data set shows that compared with the traditional T-overlay query algorithm, the efficiency of SBII_Looped is improved obviously. The traditional T-coverage query algorithm based on CPU platform has low efficiency and throughput. In order to solve this problem, this paper studies the parallel T-coverage query algorithm based on GPU. Firstly, an efficient GPU segmented inverted index structure (GSII (GPU Segmentation Inverted Indexes), GSII) is designed, which takes full account of the characteristics of GPU shared memory. The inverted index is hash segmented so that every segment can be accessed in shared memory. Secondly, on the basis of GSII, an efficient grouping parallel T-overlay query algorithm (GSPS,) is implemented, which can group queries and support both intra-query parallelism and inter-query parallelism, so this algorithm has high efficiency. Finally, experiments on real data sets show that the performance of this algorithm is improved by about 30% compared with the best GPU parallel T-overlay query algorithm.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13;TP391.3
【参考文献】
中国期刊全文数据库 前4条
1 王梅;杨思箫;乐嘉锦;;列存储数据库中压缩位图索引技术[J];计算机工程;2012年18期
2 贾连印;奚建清;李孟娟;游进国;刘勇;苗德成;;Dtrie-allpair:高效的集合T-覆盖连接算法[J];华南理工大学学报(自然科学版);2012年06期
3 潘磊;雷钰丽;王崇骏;谢俊元;;基于权重的Jaccard相似度度量的实体识别方法[J];北京交通大学学报;2009年06期
4 宋韶旭;李春平;;基于非对称相似度的文本聚类方法[J];清华大学学报(自然科学版);2006年07期
中国博士学位论文全文数据库 前2条
1 张帆;搜索引擎中索引表求交和提前停止技术优化研究[D];南开大学;2012年
2 贾连印;内存数据库中集合相似度及集合包含问题的研究[D];华南理工大学;2012年
中国硕士学位论文全文数据库 前7条
1 高鹏;推荐系统中信息相似度的研究及其应用[D];上海交通大学;2013年
2 吴斐;基于N-gram的程序代码抄袭检测方法研究[D];西南大学;2012年
3 闫丹;基于基因序列比对的立体匹配算法[D];哈尔滨理工大学;2012年
4 郑寰;数据备份中基于相似性的重复数据删除的研究[D];华中科技大学;2012年
5 卢克;基于Wikipedia的社会网络挖掘[D];哈尔滨工业大学;2009年
6 刘华;Web信息集成中数据清洗的研究[D];武汉理工大学;2007年
7 张长利;网页相似性算法的研究与实现[D];吉林大学;2005年
,本文编号:2434317
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2434317.html