分布式数据立方计算

发布时间：2018-01-29 22:50

本文关键词： 数据立方分布式 MapReduce TeraSort　出处：《中山大学》2014年硕士论文　论文类型：学位论文

【摘要】：数据立方(Data Cube)是一种有效支持OLAP的多维数据计算模型。它通过预先计算数据表中各属性间所有组合对应的GroupBy结果并将其存储起来，以缩短系统的响应时间从而提高查询效率。随着数据量的急剧增长，分布式计算(如MapReduce)的使用日益广泛，将数据立方计算与分布式结合是必然的趋势。对于代数度量，如SUM等，简单地采用MapReduce框架即可高效地完成数据立方的计算。但对于整体性度量，如DISTINCT等，若与MapReduce简单地结合，则会出现负载不均衡、中间数据过多等问题。当前最好的分布式数据立方计算算法MR-Cube，通过数据划分、合并计算的方法减缓上述问题。但是该算法对数据划分不够精准，会导致一些不必要的数据划分，加重之后的合并操作。而对于合并计算，该算法仅提出了一些规则，而无简单且有效的合并方法，并且进行合并计算时使用BUC算法亦未充分利用MapReduce框架的特性。为了更好地解决负载不均衡、中间数据过多的问题，本论文借鉴TeraSort与PipeSort，提出TeraSortPipeSort-Cube算法（以下简称TSP-Cube算法）。TSP-Cube借鉴TeraSort随机抽样的思想，根据数据出现的频率对数据进行划分，不仅可以有效避免不必要的划分，，并且适用于各种分布类型的数据集，从而有效解决负载不均衡的问题。同时TSP-Cube采用能充分利用MapReduce框架特性的PipeSort替代MR-Cube中的BUC进行合并计算，并且针对层次型的数据集，根据其属性特征以及PipeSort的特性，采用更简单有效且均匀的合并计算方案，从而解决中间数据过多的问题。论文通过实验证明，无论在均匀分布或是倾斜分布下，TSP-Cube在整体性度量函数中都有更好的性能，比已有的分布式算法更通用。此外，实验还对多种算法在代数度量下的性能进行了比较，从而得出不同类型的度量应采用的方法。
[Abstract]:Data Cube). It is a multidimensional data computing model that effectively supports OLAP. It computes and stores the GroupBy results corresponding to all the combinations of attributes in the data table in advance. In order to shorten the response time of the system and improve the query efficiency. With the rapid growth of data, distributed computing (such as MapReduce) is becoming more and more widely used. It is an inevitable trend to combine data cube computing with distributed computing. For algebraic metrics, such as SUM, the calculation of data cubes can be accomplished efficiently by using MapReduce framework, but for integral measures, such as DISTINCT, etc. If combined with MapReduce simply, there will be some problems, such as load imbalance, excessive intermediate data, etc. MR-Cube, the best distributed data cube computing algorithm, is partitioned by data. The method of merging reduces the above problem, but the algorithm is not accurate enough to divide the data, which will lead to some unnecessary data partition, which will aggravate the merging operation. The algorithm only proposes some rules, but has no simple and effective merging method, and the BUC algorithm is not fully utilized in the MapReduce framework. In order to solve the problem of load imbalance and excessive data, this paper draws lessons from TeraSort and PipeSort. TeraSortPipeSort-Cube algorithm (hereinafter referred to as TSP-Cube algorithm). TSP-Cube uses the idea of TeraSort random sampling for reference. Dividing the data according to the frequency of data occurrence can not only effectively avoid unnecessary partitioning, but also be applicable to all kinds of distributed data sets. In order to effectively solve the problem of load imbalance, at the same time, TSP-Cube uses PipeSort, which can make full use of the characteristics of MapReduce framework, instead of BUC in MR-Cube. Combined calculations. According to the attribute characteristics of hierarchical data sets and the characteristics of PipeSort, a more simple, effective and uniform scheme is adopted to solve the problem of excessive data in the middle. The experimental results show that TSP-Cube has better performance in the integral metric function under uniform distribution or tilt distribution, and is more general than the existing distributed algorithm. The performance of many algorithms under algebraic metric is compared, and the methods used in different types of metrics are obtained.
【学位授予单位】：中山大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP338.8

【相似文献】