云计算中MapReduce分布式并行处理框架的研究

发布时间：2018-07-10 06:07

本文选题：云计算 + 并行计算　；参考：《湖北工业大学》2017年硕士论文

【摘要】：大数据是互联网时代产生的格式各异、结构各不相同的数据的统称,具有数据量极大的特点。随着互联网在各个领域的深入普及,大数据产生的速度越来越快,呈现指数增长。近年来,人们将目光投放到云计算技术上,通过将计算机技术和互联网技术结合,引入新的云计算技术,解决大数据的处理问题。分布式计算(Distributed Computing)是将数量级大的工程数据划分成若干个小分块,由多个计算机即计算节点分别来计算后将运算结果进行上传,最终把结果进行合并从而得出统一的数据结论的计算过程。并行计算(Parallel Computing)指的是将一个总的计算任务划分成多个子分块,在具备了并行处理能力的计算节点上,分配给不同的处理器,各处理器之间遵循调配机制并行地执行子分块任务,最终达到提升计算规模或者提高计算速度的目的。在本设计中利用的是Hadoop分布式架构,其主要由3个子项目组成:MapReduce(编程模型和软件框架,用于在大规模计算机集群上编写对大数据进行快速处理的并行化程序)、HDFS(构建于廉价计算机集群之上的分布式文件系统)和Hadoop Common(为整体架构提供基础支撑功能)。着重研究了MapReduce处理框架并建立了一个能够对数据进行可靠处理的分布式系统平台,使其以分布式并行工作方式加快处理速度,从而体现出其在云计算应用中处理大量数据时的优势。最终通过程序测试来了分析MapReduce框架在数据处理中的流程及特点。
[Abstract]:Big data is a general term for data with different formats and different structures in the Internet era, which has the characteristics of great amount of data. With the popularization of the Internet in various fields, big data is producing more and more quickly, showing exponential growth. In recent years, people focus on cloud computing technology, through the combination of computer technology and Internet technology, the introduction of new cloud computing technology, to solve the problem of big data processing. Distributed Computing is to divide the large order of magnitude engineering data into a number of small blocks, by a number of computers, that is, computing nodes to calculate, and then upload the results of the calculation. Finally, the process of combining the results to get a unified data conclusion. Parallel Computing refers to the division of a total computing task into multiple sub-blocks, which are assigned to different processors on computing nodes with parallel processing capabilities, and each processor performs sub-block tasks in parallel according to the deployment mechanism. Finally, the purpose of raising the scale of calculation or increasing the speed of calculation is achieved. In this design, Hadoop distributed architecture is used, which is composed of three sub-projects: MapReduce (programming model and software framework). Big data (distributed file system built on cheap computer cluster) and Hadoop Common (to provide basic supporting function for the whole architecture) are used to write parallelization programs for fast processing of big data on large scale computer clusters. In this paper, the MapReduce processing framework is studied and a distributed system platform is set up to process the data reliably, so that it can work in a distributed parallel way to speed up the processing. This reflects its advantage in cloud computing applications when dealing with a large number of data. Finally, the process and characteristics of MapReduce framework in data processing are analyzed by program test.
【学位授予单位】：湖北工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP338

【参考文献】