一种基于Hadoop平台Dump模块的设计与实现

发布时间：2018-03-21 03:39

本文选题：Dump　切入点：数据处理　出处：《北京邮电大学》2012年硕士论文　论文类型：学位论文

【摘要】：随着互联网行业的飞速发展,与用户相关的信息和数据呈现出大规模的增长趋势,与此同时,针对有价值的数据进行导出、分析和处理也成为各大公司所面对的一个课题。传统的数据导出采用单机Dump1的方式来进行,针对数据库中库表的关联通常由Server端来完成,Client端负责对获取到的数据做进一步的分析和处理,然而,随着公司业务的发展和数据爆发式的增长,这种单机版的方式已经无法适应系统对性能的要求,某种程度上,成为制约业务发展的瓶颈,需要一种更加合理的架构实现来替代。 Hadoop是一个由Apache基金会开发的分布式系统基础架构,它是一个能够对大量数据进行分布式处理的软件框架,使用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力高速运算和存储。Hadoop实现了一个分布式文件系统,简称HDFS。 HDFS有着高容错性的特点,并且设计用来部署在低廉的硬件上。而且它提供高传输率来访问应用程序的数据,适合那些有着超大数据集的应用程序。本文从企业应用的角度出发,以淘宝直通车广告系统的业务背景为例,分析了当前数据在Dump和后续处理过程中所面临的问题和瓶颈,归纳总结了Hadoop平台下进行相关程序开发的技术要点,在此基础上,针对所面临的业务需求,将整个任务分解成了几个重要的功能模块,并分别给出了其在Hadoop平台相应的解决方案,完成了程序结构的设计和全部代码的实现。不但从架构上很好的解决了单机Dump所面临的各种问题,而且,使得整个系统具备了更好的稳定性、更高的可扩展性和易维护性,并在较长的一段时间内,能够应对业务快速发展和数据大规模增长的需要。本文在最后系统分析了Hadoop平台底层的工作机制和运行原理,并针对线上系统进行了相应的参数调优,有效降低了设备的负载,取得了良好的效果。
[Abstract]:With the rapid development of the Internet industry, the information and data related to users have shown a large-scale growth trend. At the same time, the export, analysis and processing of valuable data has become a topic faced by large companies. The traditional data export is carried out by single machine Dump1. The database table association is usually completed by the Server terminal, which is responsible for the further analysis and processing of the acquired data. With the development of company business and the growth of data explosion, this single version of the system can no longer meet the performance requirements of the system. To some extent, it has become a bottleneck restricting the development of business, and needs a more reasonable architecture to replace it. Hadoop is a distributed system infrastructure developed by the Apache Foundation. It is a software framework that can process a large amount of data in a distributed way. A distributed file system, HDFS. HDFS, is implemented by fully utilizing the power of cluster, high speed operation and storage. Hadoop. HDFS. HDFS has the characteristics of high fault tolerance. And it is designed to be deployed on low cost hardware, and it provides high transmission rate to access the application data, which is suitable for those applications with large data sets. From the point of view of enterprise application, taking the business background of Taobao through train advertising system as an example, this paper analyzes the problems and bottlenecks faced by the current data in the process of Dump and subsequent processing. This paper summarizes the technical points of the related program development under the Hadoop platform. On this basis, the whole task is decomposed into several important function modules according to the business requirements. The corresponding solutions in Hadoop platform are given respectively, and the design of the program structure and the implementation of all the codes are completed. Not only all kinds of problems faced by the single machine Dump are solved very well from the architecture, but also, The whole system has better stability, higher scalability and maintainability, and in a longer period of time, it can meet the needs of rapid development of business and large-scale growth of data. At the end of this paper, the working mechanism and operation principle of Hadoop platform are systematically analyzed, and the corresponding parameters are optimized for the on-line system, which effectively reduces the load of the equipment and achieves good results.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP311.52

【相似文献】