Hadoop平台下调度算法和下载机制的优化

发布时间：2019-05-10 00:16

【摘要】：在飞速发展的互联网技术中,数据量的增长呈爆炸性的趋势。数据作为信息的载体,在信息化的发展过程中占有举足轻重的地位。海量数据的管理困难、高数据存储成本、低可靠性和低安全性等是现在社会面临的重大难题。更多的企业开始涉足于云计算领域,使用云计算进行数据的分布式计算和管理。云计算服务的优势在于其可靠性高、易于扩展、存储容量大及处理速度快等特点,所以关于云计算服务系统的研究已经成为了IT技术进一步发展的趋势。论文以提高云计算实现平台Hadoop中的数据处理速度为目标,深入地研究了MapReduce和HDFS内部运行机制。针对Hadoop运行环境的异构性,为了使Hadoop能够根据每个计算节点的运算能力进行合理的任务分配,提出了一种改良的自适应负载调节调度算法(SALS)。该算法将Hadoop调度算法和当前系统负载水平相结合,实现了自适应的调度算法,并改进了Hadoop原始的推测执行算法,新的算法使得影响系统响应时间的掉队者能得到更精确的判定,掉队者任务的命中率得到了很大程度上提高,从而更加有效的提高整个系统的响应能力。针对Hadoop中HDFS的内部数据下载效率较低和可能出现的负载不均衡的问题,提出一种分布式文件并行下载算法。该算法从文件整体下载效率和数据块的下载效率两方面出发,提出了相应的优化方法,并在此基础上引入P2P的多线程思想能够有效地提高系统的下载效率。在传统并行算法的基础之上,引入了一种新的速度预测函数。该函数利用平均历史下载速度和当前速度以实现对未来下载速度更精确的预测。实验证明,与Hadoop自身的下载机制相比较,该算法能明显改变系统的性能,以尽快的满足用户下载的需求。
[Abstract]:In the rapid development of Internet technology, the growth of the amount of data shows an explosive trend. As the carrier of information, data plays an important role in the development of information. The management difficulty of massive data, high data storage cost, low reliability and low security are the major problems faced by the society at present. More enterprises begin to dabble in the field of cloud computing, using cloud computing for distributed computing and management of data. The advantage of cloud computing service lies in its high reliability, easy expansion, large storage capacity and fast processing speed, so the research on cloud computing service system has become the trend of further development of IT technology. In order to improve the data processing speed in cloud computing implementation platform Hadoop, the internal running mechanism of MapReduce and HDFS is deeply studied in this paper. Aiming at the heterogeneity of Hadoop running environment, in order to enable Hadoop to allocate tasks reasonably according to the computing power of each computing node, an improved adaptive load adjustment scheduling algorithm (SALS). Is proposed. The algorithm combines the Hadoop scheduling algorithm with the current system load level, realizes the adaptive scheduling algorithm, and improves the original speculative execution algorithm of Hadoop. The new algorithm enables those who affect the response time of the system to get a more accurate decision. The hit rate of the left-behind task has been greatly improved, so as to improve the response ability of the whole system more effectively. In order to solve the problems of low internal data download efficiency and unbalanced load in HDFS in Hadoop, a distributed file parallel download algorithm is proposed. Based on the two aspects of file download efficiency and data block download efficiency, this algorithm puts forward the corresponding optimization method, and on this basis, the introduction of P2P multi-threading idea can effectively improve the download efficiency of the system. Based on the traditional parallel algorithm, a new speed prediction function is introduced. This function uses the average historical download speed and the current speed to achieve a more accurate prediction of the future download speed. The experimental results show that compared with the download mechanism of Hadoop itself, the algorithm can obviously change the performance of the system in order to meet the download needs of users as soon as possible.
【学位授予单位】：中南大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP3

【参考文献】