面向云计算的多维数据索引研究

发布时间：2018-08-20 10:07

【摘要】：云计算技术的广泛应用使数据呈现出爆炸式增长的趋势,对传统的数据管理技术提出了新的挑战。现有的云存储系统普遍采用分布式哈希表的形式存取数据,这种基于key-value的模型在单维度查询时可以获得较高的访问效率,但是对多维度查询的支持不足。当用户提交基于多个属性列的多维查询时,由于缺乏有效的二级索引系统的支持,需要运行MapReduce任务扫描整个数据集,从而使查询效率降低。因此,近年来云存储辅助索引成为学术界研究的热点,相关成果陆续发表在数据库领域国际顶级会议和顶级期刊上。本文对云计算环境下的多维数据索引技术进行研究。论文工作分别从云存储系统中的多维数据索引、基于主从结构的双层多维数据索引、以及纯分布式环境下支持动态维度扩展的多维数据索引三个方面展开,主要内容概述如下:1.针对现有云存储系统主要支持单键值索引,缺乏有效的多维索引,导致多维度查询效率较低的问题,本文提出了一种基于UB树的新型多维云数据索引方案:CloudUB。该方案首先利用Z曲线进行多维空间的降维,然后沿Z曲线将多维空间划分成Z区域,利用B+树组织Z区域信息,建立改进的UB树索引。CloudUB在执行多维查询时能够基于Z区域滤除不可能包含查询结果的数据空间,从而提高查询效率。另外,本文设计了基于HBase的索引构建和维护机制,并提出了相应的实时和离线索引构建算法。该机制把基于Z曲线降维的B+树叶节点保存在HBase中,将原始多维空间的查找问题转化成现有云存储系统能够支持的key-value查询问题,从而支持MapReduce技术对索引表的高并发访问。最后,本文设计了CloudUB的多维查找算法并进行了效率分析。基于Hadoop2.2版本、1000万级数据量的测试结果表明,CloudUB索引方案支持灵活、高效的实时索引构建,多维查询效率显著提升。2.通过对云计算系统中数据管理方式的深入研究,本文提出了一种符合云计算系统主从管理方式的双层多维数据索引:KD-R。该索引方案为云计算系统中的每一个数据服务器上的本地数据建立一个R树索引,所有本地的R树索引共同构成双层索引系统的下层索引,然后将每个R树索引的部分节点信息发布到全局服务器层,由此构建一个统一的KD树索引。针对将哪些局部索引节点发布到全局索引的问题,本文设计了自适应的节点发布算法,以及选择发布节点的代价模型,该代价模型可以估算局部索引节点的索引代价。索引系统根据代价模型对局部数据服务器上的索引节点进行周期性的检测,然后利用自适应节点发布算法,调整发布的局部索引节点,达到动态优化KD-R索引的目的。实验结果表明,基于KD-R索引的多维查询算法具有较高的内存利用率和查询效率,展示了良好的可用性。3.针对云计算系统中用户的需求具有弹性,存在动态扩展查询维度的现状,本文提出了一种基于Chord覆盖网络和分区位图的多维云数据索引:CB-index。该索引方案采用Chord覆盖网络构建全局索引,克服了主从结构带来的全局服务器易形成瓶颈的问题,实现了纯分布式的双层索引架构;同时,本文设计了分区位图编码机制,通过分区位图构建本地数据服务器上的局部数据索引,实现了局部索引节点与Chord覆盖网络的结合。根据分区位图编码前缀可扩展的特性,本文设计了动态的索引维度扩展算法,在维度动态扩展的同时避免了索引结构的完全重构。除此之外,本文还设计了自适应的索引节点调整算法、多维查询算法和索引维护算法。实验结果表明,CB-index索引具有较高的多维查询效率,并支持灵活的索引维度扩展,能够适应云计算环境下用户的动态查询需求。
[Abstract]:The widespread application of cloud computing technology makes the data explosively increasing, and brings new challenges to the traditional data management technology. The existing cloud storage systems generally use the form of distributed hash table to access data. This key-value-based model can obtain higher access efficiency in single-dimensional query, but it is more efficient than multi-dimensional query. When users submit multi-dimensional queries based on multiple attribute columns, due to the lack of effective secondary index system support, it is necessary to run the MapReduce task to scan the entire data set, thus reducing the query efficiency. Tables are presented at international top-level conferences and journals in the database field. This paper studies the multi-dimensional data indexing technology in the cloud computing environment. The main contents of this paper are summarized as follows: 1. To solve the problem that the existing cloud storage systems mainly support single-key index and lack effective multi-dimensional index, which leads to low efficiency of multi-dimensional query, this paper proposes a new multi-dimensional cloud data index scheme based on UB tree: CloudUB. Then, the dimension of the query is reduced, and the multi-dimensional space is divided into Z-region along Z-curve, and the Z-region information is organized by B+tree to establish an improved UB tree index. CloudUB can filter out the data space which can not contain the query results based on Z-region, so as to improve the query efficiency. In addition, the index construction and dimension based on HBase are designed. The mechanism saves B+leaf nodes based on Z-curve dimensionality reduction in HBase and transforms the original multi-dimensional search problem into a key-value query problem that can be supported by existing cloud storage systems, thus supporting high concurrent access to index tables by MapReduce technology. Based on Hadoop version 2.2, the test results of 10 million level data show that CloudUB index scheme supports flexible and efficient real-time index construction, and the efficiency of multi-dimensional query is significantly improved. 2. Through the in-depth study of data management in cloud computing system, this paper proposes a new method. KD-R, a two-tier multi-dimensional data index that conforms to the master-slave management of cloud computing system, establishes an R-tree index for local data on each data server in the cloud computing system. All local R-tree indexes together form the underlying index of the double-tier index system, and then part of the nodes of each R-tree index are sent to each other. To solve the problem of which local index nodes are published to the global index, this paper designs an adaptive node publishing algorithm and a cost model for selecting publishing nodes, which can estimate the index cost of local index nodes. The cost model periodically detects the index nodes on the local data server, and then adjusts the published local index nodes by using the adaptive node publishing algorithm to dynamically optimize the KD-R index. The experimental results show that the multi-dimensional query algorithm based on KD-R index has high memory utilization and query efficiency. 3. In view of the elasticity of users'needs and the fact that query dimensions are dynamically extended in cloud computing systems, this paper proposes a multi-dimensional cloud data index: CB-index based on Chord overlay network and zonal bitmap. At the same time, this paper designs a partitioned bitmap encoding mechanism, builds a local data index on the local data server through the partitioned bitmap, and realizes the combination of local index nodes and the Chord overlay network. In addition, an adaptive index node adjustment algorithm, a multi-dimensional query algorithm and an index maintenance algorithm are also designed. The experimental results show that CB-index index has high efficiency in multi-dimensional query and can avoid the complete reconstruction of index structure. It supports flexible index dimension expansion and is able to meet users' dynamic query requirements in cloud computing environment.
【学位授予单位】：电子科技大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】