MapReduce在科学计算中的研究与改进

发布时间：2018-06-23 17:24

本文选题：MapRednce + 云计算　；参考：《安徽大学》2013年硕士论文

【摘要】：随着异构数据的急剧增加,云计算应运而生。作为云计算的编程模型MapReduce同样也得到了广泛的关注,特别是在学术界。为了解决覆盖及中间数据的存储等诸多问题,诸多学者提出了许多地改进办法并形成了自己的编程模型,如有Hadoop、Twister和Haloop等。为了能够实现迭代算法,Haloop模型中增加了Loop Control机制,该机制在具体的实施时主要是增加了两个函数,即ADDMap和ADDReduce,这两个函数的目的就在于来增加其迭代的次数。同时在Twister模型中也有相应控制loop的机制。同样,在本文中为了更好的执行具有迭代的算法,不但保持了原有的接口和函数,而且还在Map函数、Reduce函数、ADDMap函数和ADDReduce函数中增加了一个参数M,M的作用主要是来区分科学计算中的四类算法的。如果M等于1就代表是第一类算法；如果M等于2时就代表第二类算法；如果M等于3时就代表第三类算法；如果M等于4时就代表第四类算法。由于第三类和第四类算法都是具有迭代的算法,这时把该两类算法经常要用到的函数及接口都写成了适配器。在具体做实验时,开发人员就可以根据需要往函数体里面增加相应的函数体。为了确保数据的安全,在实验时变量被声明成保护型。把那些变化不大的数据放在缓冲池,这样一来就可在Slave节点的本地系统上读写数据,而不用从Master节点上读写数据,这样不但可以减轻Master节点的工作量,而且可以提高运行效率。基于种种调度算法的缺点,提出改进的算法。该算法增加如下参数：计算成本,任务的最后期限和客服端机器的处理速度等参数,还设置了两个队列：计算资源队列和最后期限队列。其中,计算资源队列中任务的优先级是由计算成本来决定。计算计算成本时要乘以一个权值Weight,该权值的大小是由在Map函数、Reduce函数、ADDMap函数和ADDReduce函数中增加的参数M来决定。如果M等于1时,Weight也等于1；如果M等于2时,Weight也等于2；如果M等于3时,Weight也等于3；如果M等于4时,Weight也等于4。最后期限队列的优先级是由最后期限(deadline)来决定。并设置计算资源队列的所有任务的优先级都高于最后期限队列中所有任务,如果最后期限队列存在有最后期限等于0的任务,则将该任务直接插到计算资源队列的队首的位置。这样一来,不仅确保了大任务的高效执行,同时也照顾了小任务的执行。改进的算法取得了很好的性能。在文章的最后举出相关的例子并利用Hadoop作了相应的实验。
[Abstract]:With the rapid increase of heterogeneous data, cloud computing emerges as the times require. MapReduce, as a programming model for cloud computing, has also received widespread attention, especially in academia. In order to solve the problems of overlay and storage of intermediate data, many scholars have proposed many improved methods and formed their own programming models, such as Hadoop Twister and Haloop. In order to implement the iterative algorithm, the Loop Control mechanism is added to the Haloop model. This mechanism mainly adds two functions, namely, ADDMap and ADDReduce. the purpose of these two functions is to increase the number of iterations. At the same time, there is a corresponding control mechanism of loop in the Twister model. Similarly, in this paper, in order to better implement the iterative algorithm, not only the original interface and function, Moreover, the function of adding a parameter MKM to the Map function / reduce function / ADDMap function and ADDReduce function is mainly to distinguish four kinds of algorithms in scientific computation. If M equals 1, it represents the first kind of algorithm; if M equals 2, it represents the second kind of algorithm; if M equals 3, it represents the third kind of algorithm; if M equals 4, it represents the fourth kind of algorithm. Since the third and fourth algorithms are iterative algorithms, the functions and interfaces often used by the two algorithms are written as adapters. When experimenting, the developer can add the corresponding function body to the function body as needed. In order to ensure the security of the data, variables are declared as protected in the experiment. It can read and write data on the Slave node's local system without reading and writing data from the master node, which can not only reduce the workload of the master node, but also improve the running efficiency. Based on the shortcomings of various scheduling algorithms, an improved algorithm is proposed. The algorithm adds the following parameters: computation cost, task deadline and the processing speed of the customer service machine, and sets two queues: computational resource queue and deadline queue. The priority of computing tasks in resource queue is determined by computing cost. The cost is to be multiplied by a weight, the size of which is determined by the addition of parameters M to the Map function, the reduce function, the ADDMap function, and the ADDReduce function. If M is equal to 1, weight is equal to 1; if M equals 2, weight is equal to 2; if M is equal to 3, weight is equal to 3; if M is equal to 4, it is also equal to 4. The priority of the deadline queue is determined by the deadline (deadline). The priority of all tasks in the computation resource queue is higher than that in the deadline queue. If the deadline queue has a task with a deadline equal to 0, the task is inserted directly into the head of the computing resource queue. In this way, not only to ensure the efficient implementation of large tasks, but also to take care of the implementation of small tasks. The improved algorithm achieves good performance. At the end of this paper, some examples are given and Hadoop is used to do some experiments.
【学位授予单位】：安徽大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP311.1;TP338.6

【参考文献】