基于GPU的复杂SQL查询优化方法研究

发布时间：2018-07-24 15:21

【摘要】：随着信息技术的发展，数据库中数据存储规模越来越大，呈现出数据量大、数据类型多、价值密度低的特点。在这个背景下，数据库的查询操作从传统的单一维度简单查询扩展为多维度的复杂查询。复杂查询作为数据库系统分析数据的重要手段，在实际分析处理数据过程中扮演着重要角色。通过查询请求，企业决策人员能快速获得自己最关注的信息。利用传统的数据库分析手段对海量数据进行提取、存储、分析得到实时结果变得越来越困难，，也制约了企业管理者的决策。为了提高大规模数据下多维复杂查询的速度，本文结合了图形处理器并行计算能力和列存储数据库的存储特点，提出了适用于并行查询的列式存储模型以及GPU并行加速查询的策略。本文的主要研究内容如下：（1）研究数据库复杂查询的相关理论和GPU并行计算模型，并总结出传统数据库查询优化技术。重点分析了不同数据库的存储策略和压缩算法；（2）提出一种基于稀疏索引的物理存储模型，模型在列存储的基础上采用分段划分的策略，同时根据GPU特点采用差值压缩算法进行数据压缩处理，并结合GPU高并行计算能力实现对数据的并行压缩；（3）提出一种基于GPU的复杂查询并行执行算法：结合GPU查询原语操作实现对复杂查询的优化。其中重点实现了对范围查询和分组查询的优化，提出了对分组查询结果合并的策略。提出利用流水线调度策略解决实验中存在IO时间过长的问题，一定程度上加快了查询响应的速度；（4）通过实验证明了利用GPU加速压缩算法和查询加速算法的优越性：将本文提出的查询模型和传统数据库采用美国交易处理效能委员会提出TPC-H测试数据集进行对比分析，证明了本文查询模型在大规模数据集下相比于现有GPU数据库取得5-8倍的加速比。
[Abstract]:With the development of information technology, the scale of data storage in database becomes larger and larger, showing the characteristics of large amount of data, many types of data, and low value density. In this context, the query operation of database is extended from simple query with single dimension to complex query with multiple dimensions. As an important means of data analysis in database system, complex query plays an important role in the process of data analysis and processing. Through the query request, the enterprise decision-makers can quickly obtain their most concerned information. It is becoming more and more difficult to extract, store and obtain real-time results by using the traditional database analysis method, which also restricts the decision-making of enterprise managers. In order to improve the speed of multi-dimensional complex query under large-scale data, this paper combines the parallel computing ability of GPU and the storage characteristics of column storage database. A column storage model suitable for parallel query and a strategy of GPU parallel accelerated query are proposed. The main contents of this paper are as follows: (1) the related theories of complex database query and GPU parallel computing model are studied, and the traditional database query optimization techniques are summarized. The storage strategies and compression algorithms of different databases are analyzed emphatically. (2) A physical storage model based on sparse index is proposed. At the same time, according to the characteristics of GPU, the difference compression algorithm is used to compress the data, and the parallel compression of data is realized by combining the high parallel computing ability of GPU. (3) A parallel execution algorithm of complex query based on GPU is proposed. The optimization of complex query is realized by combining GPU query primitive operation. The optimization of range query and grouping query is emphasized, and the strategy of merging the result of grouping query is put forward. The pipeline scheduling strategy is proposed to solve the problem that IO time is too long in the experiment, which speeds up the query response to a certain extent. (4) the superiority of using GPU accelerated compression algorithm and query acceleration algorithm is proved by experiments. The TPC-H test data is presented by the American transaction processing efficiency Commission (TPAEC) by using the query model and traditional database presented in this paper. Set for comparative analysis, It is proved that the query model in this paper has a speedup of 5-8 times compared with the existing GPU database under the large-scale data set.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP311.13;TP333

【共引文献】