基于Spark的DSP数据仓库优化的研究与实现

发布时间：2018-05-27 22:43

本文选题：数据仓库 + Spark　；参考：《吉林大学》2017年硕士论文

【摘要】：现如今,当代社会是计算机信息科技技术高速发展的现代社会。各行业伴随着“互联网+”这个风潮迅速发展,产生了大量不同领域的互联网数据。企业运营产生数据,数据仓库依据数据为企业各级别的决策提供策略,企业的发展与数据的关系越发紧密,所以我们迫切的需要寻求一种新的大数据处理优化方法和技术支撑企业的发展。现在比较流行的大数据计算框架是Hadoop和Spark,大部分公司学习并采用该技术能够满足自身业务的需要。在这种情况下,本文针对DSP(Demand-Side Platform)需求方广告投放行业提出了基于Spark的数据仓库优化的研究设计。通过对数据仓库的各个流程进行严格分析,为使整体数据处理流程效率提高,分别选择从框架流程、数据存储、数据处理三方面进行全方位递进式优化。在数据仓库的框架中,当数据从数据源传送到Hadoop的Spark的过程中,选择加入高吞吐量分布式的发布订阅消息系统即Kafka,进而可以实现快速统一线上和离线的消息。对于数据存储速度慢的问题,Spark Streaming从HBase和HDFS(Hadoop Distributed File System)组合的开源数据库中读写数据,采用分区连接的方式可以加快数据存取的速度。而针对数据倾斜现象的数据处理阶段,采用抽样聚合算法的方案,能够较好的解决数据大小分配不一致导致的极大任务拖慢整个任务完成进度的现象。经过实验数据的测试比较,针对普通数据即非倾斜数据,数据仓库优化方案花费的时间整体比传统的数据仓库操作过程减少10%以上,同时提高了系统的吞吐率和存储性能。针对倾斜数据,本文提出的抽样聚合算法在保证数据处理结果准确的情况下,能够较快地聚合数据,进而较好地提高整体的数据仓库执行效率。
[Abstract]:Nowadays, the contemporary society is a modern society with the rapid development of computer information technology. With the rapid development of the Internet, various industries have produced a large number of Internet data in different fields. Enterprise operation produces data, and data warehouse provides strategy for enterprise decision-making according to data. The development of enterprise is more and more closely related to data. Therefore, we urgently need to seek a new big data processing optimization method and technology to support the development of enterprises. The popular big data computing frameworks are Hadoop and Spark.Most companies learn and adopt this technology to meet their business needs. In this case, this paper puts forward the research and design of data warehouse optimization based on Spark for the demand side advertising industry of DSP(Demand-Side platform. Through the strict analysis of each flow of data warehouse, in order to improve the efficiency of the whole data processing process, we select three aspects of frame flow, data storage, data processing to carry on the omni-directional progressive optimization. In the framework of data warehouse, when the data is transferred from the data source to the Spark of Hadoop, we choose to join the high throughput distributed publish / subscribe message system (Kafka), which can realize the fast unification of online and offline messages. To solve the problem of slow data storage, Spark Streaming can read and write data from an open source database combined with HBase and HDFS(Hadoop Distributed File System). Using partitioned connection can accelerate the speed of data access. In the data processing phase of data skew phenomenon, the scheme of sampling aggregation algorithm can solve the problem that the maximum task caused by the inconsistency of data size can slow down the completion of the whole task. Through the test and comparison of the experimental data, the time spent by the data warehouse optimization scheme is reduced by more than 10% compared with the traditional data warehouse operation process, and the throughput and storage performance of the system are improved. For tilted data, the sampling aggregation algorithm proposed in this paper can aggregate data quickly and improve the efficiency of data warehouse execution under the condition that the data processing results are accurate.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】