分布式OLAP语义缓存算法研究

发布时间：2018-05-23 18:21

本文选题：封闭立方体 + Spark　；参考：《昆明理工大学》2017年硕士论文

【摘要】：对数据仓库建模形成的数据立方体模型,通过删除其元组中的非封闭单元进行压缩并分层形成了分层封闭立方体。Spark是一个基于内存的快速通用的大数据并行计算框架,对此本文基于分层封闭立方体,利用Spark,设计和实现了两种有效的分布式OLAP查询算法:SLCCQuery及其优化算法SLCC_LayeredQuery。不同参数的数据集上的实验验证了本文提出的Spark环境下的分布式OLAP查询算法的有效性及其优化算法的相对高效性。为了进一步提高Spark环境下的分布式OLAP查询效率,本文在Spark环境下设计了一种新的分布式OLAP语义缓存算法。该算法是通过存储等价类的上下界而不是单个数据元组信息,来代表查询集合中的元组,同时缓存项及不同缓存项间的语义关系组成了代数格结构,查询时通过语义关系剪枝,进一步缩小了在缓存中的查找范围,文中最后的实验充分验证了该分布式OLAP语义缓存算法的有效性及其相对高效性。本文主要研究内容如下:(1)通过去掉数据立方体中的非封闭单元进行压缩并分层形成了分层封闭立方体,同时基于Spark,本文设计并实现了两种有效的分布式OLAP查询算法:SLCCQuery 及其优化算法 SLCC__LayeredQuery;(2)根据分布式OLAP查询算法的缓存设计需要,同时针对通常的缓存查询技术,例如,页面缓存,元组缓存等没有利用查询缓存项中的语义关系的特性,本文提出一种新的OLAP查询缓存技术——语义OLAP缓存;(3)通过语义OLAP缓存模型,并基于Spark,木文设计了两种Spark环境下的分布式OLAP缓存算法,并结合不同的缓存置换策略,实验验证了本文提出的分布式OLAP语义缓存的算法的有效性和相对高效性。
[Abstract]:Based on the data cube model modeled by data warehouse, by removing the unclosed unit from the tuple and compressing it into layers, the layered closed cube Spark is a fast and universal big data parallel computing framework based on memory. In this paper, based on the hierarchical closed cube, two effective distributed OLAP query algorithms: OLAP query and its optimization algorithm are designed and implemented by using Spark. Experiments on data sets with different parameters verify the effectiveness of the distributed OLAP query algorithm under the Spark environment and the relative efficiency of the optimization algorithm. In order to improve the efficiency of distributed OLAP query in Spark environment, a new distributed OLAP semantic cache algorithm is designed under Spark environment. The algorithm represents the tuples in the query set by storing the upper and lower bounds of the equivalent class rather than the single data tuple information. At the same time, the semantic relations between the cached items and the different cached items form an algebraic lattice structure, and the query is pruned by semantic relations. Finally, the effectiveness and relative efficiency of the distributed OLAP semantic cache algorithm are fully verified by the experiments in this paper. The main contents of this paper are as follows: (1) by removing the unclosed elements from the data cube, we compress and delaminate to form a layered closed cube. At the same time, based on Spark, this paper designs and implements two effective distributed OLAP query algorithms: SLCCQuery and its optimization algorithm SLCC _ S _ S _ Q _ 2) according to the cache design needs of distributed OLAP query algorithm, and aiming at the common cache query technology, for example, page cache, Tuple caching does not take advantage of the semantic relationship in query cache items. In this paper, a new OLAP query caching technique, semantic OLAP cache, is proposed, which is based on semantic OLAP caching model. Based on Spark, this paper designs two distributed OLAP cache algorithms under Spark environment, and combines different cache replacement strategies to verify the effectiveness and relative efficiency of the proposed distributed OLAP semantic cache algorithm.
【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】