基于多核的细粒度并行的集合相似连接

发布时间：2018-05-05 08:47

本文选题：相似连接 + 并行　；参考：《计算机学报》2017年10期

【摘要】：相似连接是指在给定的两个数据集中,根据给定的相似性度量函数来计算数据之间的相似度,并找出所有相似度不小于给定阈值的数据对的操作.相似连接作为一个基本的操作,被广泛地应用于各种领域.随着网络和移动应用等信息技术的不断发展,数据呈现爆炸式增长,海量数据的分析需要强大的计算能力.为了满足不断增长的计算需求,提高计算效率和计算性能,计算机的体系结构也不断升级,出现了多核多处理器架构.为了提高相似连接的效率和计算资源的利用率,文中提出了基于多核的并行相似连接方法.相似连接操作与传统关系数据库中结构化数据之间的连接操作不同,相似连接处理的是异构数据,每一条数据处理的代价与其结构有关.为了实现多核处理器之间的任务均衡,文中提出了适合相似连接操作特点的数据长度均衡的数据划分方法.根据相似连接操作遵循Filter-Refine两阶段框架的特点,结合现代计算机体系结构的多核特性,提出了基于共享索引的任务分解方法和基于独立索引的任务分解方法.文中利用提出的数据划分方法和任务分解方法,实现了基于多核的并行化相似连接算法,包括自连接和R-S连接.文中对两种不同的实现方式的时间代价进行了分析,其中包括索引更新、索引扫描以及集合交运算的代价,从理论分析的角度证明了数据长度均衡划分和独立索引的实现方式在执行效率上具有优势.文中实验部分利用不同的数据集在多核多处理器平台上对并行相似连接的不同实现方式的执行效率和可扩展性进行了验证.实验结果证明,文中提出的数据长度均衡的数据划分方法以及基于独立索引的任务分解方法可以有效地提高并行相似连接的执行效率,在16核平台上使用16个线程在DBLP数据集上执行并行的相似自连接以及在CiteSeer和DBLP两个数据集上执行并行的R-S连接都可以在1秒内完成.
[Abstract]:Similarity join is an operation that calculates the similarity between data according to the given similarity measure function in two given data sets and finds out all the data pairs whose similarity is not less than a given threshold. As a basic operation, similar join is widely used in various fields. With the development of information technology, such as network and mobile application, the data is increasing explosively. The analysis of massive data needs powerful computing power. In order to meet the increasing demand of computing and improve the computing efficiency and performance, the architecture of computer has been continuously upgraded, and a multi-core and multi-processor architecture has emerged. In order to improve the efficiency of similar join and the utilization of computing resources, a parallel similar join method based on multi-core is proposed in this paper. The similar join operation is different from the connection operation between the structured data in the traditional relational database. The similar join deals with the heterogeneous data, and the cost of each data processing is related to its structure. In order to realize the task equalization between multi-core processors, a data partition method of data length equalization suitable for the characteristics of similar connection operation is proposed in this paper. According to the characteristics of the similar join operation following the Filter-Refine two-stage framework and the multi-core characteristics of modern computer architecture, a task decomposition method based on shared index and a task decomposition method based on independent index are proposed. Using the proposed data partitioning method and task decomposition method, a parallel parallel similar join algorithm based on multi-core is implemented, including self-join and R-S connection. In this paper, the time cost of two different implementations is analyzed, including the cost of index update, index scan and set intersection. From the point of view of theoretical analysis, it is proved that the implementation of data length equilibrium partition and independent index has advantages in execution efficiency. In the experiment part, we use different data sets to verify the efficiency and expansibility of parallel similar connection implementation on multi-core and multi-processor platform. Experimental results show that the proposed data length equalization method and task decomposition method based on independent index can effectively improve the efficiency of parallel similar join execution. Using 16 threads on the 16-core platform to perform parallel similar self-join on DBLP dataset and parallel R-S connection on CiteSeer and DBLP data sets can be completed in one second.
【作者单位】：天津工业大学计算机科学与软件学院;
【基金】：国家自然科学基金(61402329,61373104) 国家留学基金委资助~~
【分类号】：TP311.13

【相似文献】