企业环境下分布式数据仓库的设计与优化技术的研究

发布时间：2018-05-06 20:15

本文选题：分布式系统 + 数据仓库　；参考：《北京邮电大学》2016年硕士论文

【摘要】：进入新世纪以来,在互联网、物联网技术的带动下,企业可获得的数据量也越来越大。企业对数据的需求也不再只针对日常的事务处理,很多企业开始构建大型的数据仓库来存储和分析面临的海量数据。数据仓库收集不同来源和不同结构的用户数据,并把这些数据按主题进行分类和集成,使得对同一主题的数据的分析结果更有针对性和可靠性,对管理人员的决策也更有参考价值。目前传统集中式的数据仓库由于在扩展性和性能方面的不足,已开始无法承受对海量数据的处理压力。Hadoop的兴起使人们认识到分布式技术的强大计算能力,分布式架构的数据仓库将成为未来数据仓库系统的发展方向。针对这种情况,本文从数据仓库的分布式架构设计、元数据的统一管理、数据仓库技术与Hadoop开源框架相结合三方面做出分析和设计。结合Hadoop开源框架、My SQL数据库、分布式存储技术、impala并行查询技术,设计了一套完整的系统架构方案。以MapReduce任务的方式完成对源数据的集成,即ETL(Extract-Transform-Load)工作。在元数据管理方面,研究了数据仓库系统的元数据管理机制,以及impala查询引擎的元数据实现方案,设计和实现了基于MySQL的集中式元数据管理模块。该系统首先通过MapReduce任务对源数据进行抽取和转换,将中间结果数据按照用户指定的数据切分方式进行数据的分布式划分,之后进行并行导入;由MySQL数据库以lib的形式存储和管理系统的元数据;存储部分使用一种高效单机存储引擎,实现各存储节点对数据的高效存储和扫描;数据的查询通过impala并行查询引擎实现,查询与存储共用一套元数据方案,实现了元数据信息的统一管理。通过该系统,企业用户不仅可以实现海量数据的高效管理,也可对数据进行多维分析处理,为企业策略的指定和调整提供数据支持。最后,通过实验测试分布式系统的导入和查询性能,通过对测试结果的分析说明该系统在处理企业数据方面是有效的。
[Abstract]:Since entering the new century, with the Internet of things and Internet of things technology, enterprises can obtain more and more data. The demand of enterprises for data is no longer only for daily transaction processing, many enterprises begin to build large data warehouse to store and analyze the huge amount of data. The data warehouse collects user data from different sources and structures, classifies and integrates the data by topic, making the analysis of data on the same subject more relevant and reliable, It is also more valuable for managers to make decisions. At present, due to the lack of scalability and performance of traditional centralized data warehouse, it has been unable to bear the pressure of processing mass data. Hadoop has made people realize the powerful computing power of distributed technology. Data warehouse with distributed architecture will become the development direction of data warehouse system in the future. Aiming at this situation, this paper analyzes and designs the distributed architecture design of data warehouse, the unified management of metadata, the combination of data warehouse technology and Hadoop open source framework. Combined with Hadoop open source framework, my SQL database, distributed storage technology and impala parallel query technology, a complete system architecture scheme is designed. The integration of source data is accomplished by MapReduce task, that is, ETLX Extract-Transform-Load. In the aspect of metadata management, the metadata management mechanism of data warehouse system and the metadata implementation scheme of impala query engine are studied. The centralized metadata management module based on MySQL is designed and implemented. The system firstly extracts and transforms the source data through the MapReduce task, divides the intermediate result data according to the data segmentation mode specified by the user, and then carries on the parallel import. The metadata of the system is stored and managed by the MySQL database in the form of lib. The storage part uses an efficient single-machine storage engine to realize the efficient storage and scanning of the data of each storage node, and the query of the data is realized by the impala parallel query engine. Query and storage share a set of metadata scheme to realize the unified management of metadata information. Through this system, enterprise users can not only realize the efficient management of massive data, but also carry out multidimensional analysis and processing of the data, and provide data support for the designation and adjustment of enterprise policies. Finally, the paper tests the import and query performance of the distributed system through experiments. The analysis of the test results shows that the system is effective in dealing with enterprise data.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】