分布式数据处理系统中作业性能优化关键技术研究
发布时间:2019-05-29 22:24
【摘要】:随着各行业中数据规模地增长,分布式数据处理技术被广泛应用于各行业数据分析中。Map Reduce具有使用方便、易于编程、容错性强和高性价比等优势,目前已经成为主流的分布式处理模型,并在各行业的大规模数据分析中得到了广泛的应用。然而随着数据处理需求的不断增长,MapReduce自身存在的一些缺陷也逐渐显露,最常见的缺陷包括:MapReduce中较多的配置参数、不完善的任务调度策略、数据本地化有效性低和资源槽分配不合理等。这些不足导致MapReduce作业执行效率低下。MapReduce作业性能调优是通过优化MapReduce中各方面的不足来改善MapReduce作业性能,使得作业在MapReduce中的执行时间大大降低,因此,MapReduce作业性能优化的研究具有重要的科学意义和应用价值。本文针对MapReduce作业性能优化的若干关键问题进行研究。在总结作业性能优化相关研究成果的基础上,建立了I/O代价函数来阐述配置参数对作业执行时间的重要性;提出了通过特征选择的方法来选择影响作业执行时间的重要参数,同时通过优化数据本地化、数据副本置放策略和任务调度的方法来改善作业执行时间。本文的研究内容和创新点具体包含以下几个方面:(1)通过建立I/O读写字节数和I/O请求数目函数证明了存在部分配置参数会直接影响MapReduce作业的执行时间。并验证了各配置参数对MapReduce作业执行时间的影响程度不同。(2)提出了基于核函数惩罚的聚类特征选择算法(IK-means),解决了MapReduce中因配置参数过多而使得平台管理人员配置困难的问题。在IK-means中,为了准确地判断各特征参数的影响程度,采用各向异性高斯核函数代替了传统的高斯核函数,通过各向异性高斯核函数不同方向的参数(也被称为核宽)来反映每个特征的重要程度。提出利用梯度下降算法来最小化各向异性高斯核的核宽向量,使得所选择的特征进行聚类的效果能最接近原始特征的聚类效果,从而达到特征选择的目的。针对聚类特征选择算法对初始点选择敏感的问题,提出了全局感知的局部密度初始点选择算法。通过理论证明和实验结果表明,本文提出的特征选择算法在配置参数的选择中具有良好的效果。(3)提出了基于二部图最小权匹配的数据本地化算法,解决了MapReduce中同时满足多任务数据本地化的问题,同时提出了动态副本自适应算法,通过热数据的识别解决了动态副本置放技术中的如何确定备份副本的问题。理论论证和实验结果表明动态自适应副本算法有效地支撑了二部图最小权匹配算法,并改善了多任务数据本地化的有效性。(4)提出了满足用户时间需求和资源优化的任务调度算法,利用历史作业描述文件中的时间和资源消耗信息对新作业执行时间和槽资源的消耗进行计算,不仅达到了满足用户时间需求的目的,还解决了MapReduce作业运行过程中资源消耗过高的问题。算法的有效性不仅从作业执行过程的理论分析得到了验证,且实验结果也验证了该算法的在作业执行时间和槽资源消耗的优势1。
[Abstract]:With the growth of data in all industries, distributed data processing technology is widely used in data analysis of all industries. Map Reduce has the advantages of convenient use, easy programming, high fault tolerance and high cost performance. It has become the mainstream distributed processing model, and has been widely used in the large-scale data analysis of all industries. However, with the increasing demand of data processing, some of the defects of MapReduce themselves are gradually revealed. The most common defects include the more configuration parameters in MapReduce, the incomplete task scheduling strategy, the low localization efficiency of the data, and the unreasonable allocation of the resource slots. These deficiencies have led to an inefficient implementation of the MapReduce job. MapReduce job performance optimization is to improve the MapReduce job performance by optimizing the lack of various aspects in MapReduce, so that the execution time of the operation in MapReduce is greatly reduced, and therefore, the research of MapReduce job performance optimization has important scientific significance and application value. In this paper, some key problems of the performance optimization of MapReduce are studied. Based on the research results of the optimization of operation performance, I/ O cost function is set up to illustrate the importance of the configuration parameters on the execution time of the job. The method of feature selection is proposed to select the important parameters that affect the execution time of the job, while the data localization is optimized. A method of data copy placement policy and task scheduling is used to improve job execution time. The research content and innovation point of this paper specifically include the following aspects: (1) The execution time of MapReduce job can be directly affected by setting up I/ O read-write number and I/ O request number function. And the influence degree of each configuration parameter on the execution time of the MapReduce job is verified to be different. (2) The clustering feature selection algorithm (IK-means) based on kernel function penalty is proposed, which solves the problem that the configuration parameters of the platform management are difficult due to the too much configuration parameters in MapReduce. In the IK-means, in order to accurately determine the degree of influence of each characteristic parameter, an anisotropic Gaussian kernel function is used instead of the traditional Gaussian kernel function, and the important degree of each feature is reflected by the parameters of different directions of the anisotropic Gaussian kernel function (also referred to as the kernel width). A gradient descent algorithm is proposed to minimize the kernel width vector of the anisotropic Gaussian kernel, so that the effect of clustering the selected features can be the closest to the clustering effect of the original feature, so as to achieve the purpose of feature selection. In this paper, the initial point selection algorithm is proposed for the initial point selection by the clustering feature selection algorithm. The theoretical and experimental results show that the feature selection algorithm proposed in this paper has a good effect in the selection of configuration parameters. and (3) a data localization algorithm based on a two-part graph minimum weight matching is proposed, the problem that the multi-task data localization is met in the MapReduce is solved, and a dynamic copy adaptive algorithm is proposed, The problem of how to determine the backup copy is solved by the identification of the hot data. The theoretical and experimental results show that the dynamic adaptive replica algorithm effectively supports the least-weight matching algorithm of the two-part graph, and improves the validity of the multi-task data localization. (4) the task scheduling algorithm for meeting the time requirement and resource optimization of the user is proposed, the time and the resource consumption information in the historical operation description file are used to calculate the execution time of the new operation and the consumption of the slot resources, And the problem of high resource consumption in the operation process of the MapReduce operation is also solved. The validity of the algorithm is not only verified from the theoretical analysis of the job execution process, but also the experimental results also verify the advantage of the algorithm in the operation execution time and the slot resource consumption.
【学位授予单位】:重庆大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP311.13
,
本文编号:2488270
[Abstract]:With the growth of data in all industries, distributed data processing technology is widely used in data analysis of all industries. Map Reduce has the advantages of convenient use, easy programming, high fault tolerance and high cost performance. It has become the mainstream distributed processing model, and has been widely used in the large-scale data analysis of all industries. However, with the increasing demand of data processing, some of the defects of MapReduce themselves are gradually revealed. The most common defects include the more configuration parameters in MapReduce, the incomplete task scheduling strategy, the low localization efficiency of the data, and the unreasonable allocation of the resource slots. These deficiencies have led to an inefficient implementation of the MapReduce job. MapReduce job performance optimization is to improve the MapReduce job performance by optimizing the lack of various aspects in MapReduce, so that the execution time of the operation in MapReduce is greatly reduced, and therefore, the research of MapReduce job performance optimization has important scientific significance and application value. In this paper, some key problems of the performance optimization of MapReduce are studied. Based on the research results of the optimization of operation performance, I/ O cost function is set up to illustrate the importance of the configuration parameters on the execution time of the job. The method of feature selection is proposed to select the important parameters that affect the execution time of the job, while the data localization is optimized. A method of data copy placement policy and task scheduling is used to improve job execution time. The research content and innovation point of this paper specifically include the following aspects: (1) The execution time of MapReduce job can be directly affected by setting up I/ O read-write number and I/ O request number function. And the influence degree of each configuration parameter on the execution time of the MapReduce job is verified to be different. (2) The clustering feature selection algorithm (IK-means) based on kernel function penalty is proposed, which solves the problem that the configuration parameters of the platform management are difficult due to the too much configuration parameters in MapReduce. In the IK-means, in order to accurately determine the degree of influence of each characteristic parameter, an anisotropic Gaussian kernel function is used instead of the traditional Gaussian kernel function, and the important degree of each feature is reflected by the parameters of different directions of the anisotropic Gaussian kernel function (also referred to as the kernel width). A gradient descent algorithm is proposed to minimize the kernel width vector of the anisotropic Gaussian kernel, so that the effect of clustering the selected features can be the closest to the clustering effect of the original feature, so as to achieve the purpose of feature selection. In this paper, the initial point selection algorithm is proposed for the initial point selection by the clustering feature selection algorithm. The theoretical and experimental results show that the feature selection algorithm proposed in this paper has a good effect in the selection of configuration parameters. and (3) a data localization algorithm based on a two-part graph minimum weight matching is proposed, the problem that the multi-task data localization is met in the MapReduce is solved, and a dynamic copy adaptive algorithm is proposed, The problem of how to determine the backup copy is solved by the identification of the hot data. The theoretical and experimental results show that the dynamic adaptive replica algorithm effectively supports the least-weight matching algorithm of the two-part graph, and improves the validity of the multi-task data localization. (4) the task scheduling algorithm for meeting the time requirement and resource optimization of the user is proposed, the time and the resource consumption information in the historical operation description file are used to calculate the execution time of the new operation and the consumption of the slot resources, And the problem of high resource consumption in the operation process of the MapReduce operation is also solved. The validity of the algorithm is not only verified from the theoretical analysis of the job execution process, but also the experimental results also verify the advantage of the algorithm in the operation execution time and the slot resource consumption.
【学位授予单位】:重庆大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP311.13
,
本文编号:2488270
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2488270.html