基于遗传算法的分布式数据挖掘MapReduce架构研究

发布时间：2018-09-08 17:26

【摘要】：近年来,随着信息技术的快速发展,直接或间接的产生了难以估量的海量数据,这对传统数据挖掘算法提出了新的挑战,如何提高海量数据环境下传统数据挖掘算法的通用性和性能成为当前的研究热点。为了解决这一问题,研究人员将传统数据挖掘算法与新兴技术如云计算平台等融合,利用分布式计算能力提高算法的性能,取得了良好效果。但是由于数据挖掘算法种类繁多,单一的数据挖掘算法需要特定的实现模式,没有通用的架构满足数据挖掘算法的多样性,并能同时提高算法的性能。本文在前人经验的基础上,提出了一种基于遗传算法的分布式数据挖掘MapReduce架构,旨在帮助用户更通用的处理数据挖掘算法并提升算法的性能。架构要素之一的MapReduce提供良好的分布式计算能力,另一要素遗传算法具有良好的全局搜索和优化能力,通过模拟种群进化的方式搜索到最优解,使得用户只需要实现遗传算法而不必担心算法的并行化。本文的主要贡献如下,提出了一种基于遗传算法的分布式数据挖掘MapRed uce架构,架构分为核心层和用户层,核心层封装了MapReduce的操作,用户层提供给用户扩展接口,通过具体问题实现具体的遗传算法,可以有效的处理数据挖掘算法在海量数据方面的应用。架构包括六个组件,其中Diver组件是框架的主要部分,主要功能是实现用户交互并负责启动集群上的Jobs;Generator组件主要作用是通过调用用户层的遗传算法实现,然后配合Driver启动Job完成种群的进化;Terminator组件的作用是在Generator过程中判断是否满足终止条件;Initialiser组件负责初始化种群,该组件是可选的;Migrator组件负责种群迁移策略的实现,由用户层实现;最后的SolutionFilter组件则是将符合条件的个体筛选出来,每个组件相互协作完成架构的功能。本文用三个算法对架构性能进行验证,首先设计实现了针对K-Medoids的遗传算法,以聚类准确率为个体适应度值,利用MapReduce加强聚类计算,实验显示得到良好的聚类效果。其次设计实现了针对旅行商问题(Traveling Salesman Problem)的遗传算法,以个体所经过城市距离的倒数作为适应度函数,距离越短个体的适应度值越高,实验结果表明,在架构中运行的TSP算法能有效处理大数据并且比同等级的算法能更快发现最优解。最后,设计实现了针对特征子集选择(Feature Subset Selection)问题的遗传算法,以特征选择的分类准确率作为适应度值,实验结果表明,运行在架构下的FSS算法能更快速收敛并提高了准确率。综上,本文提出的基于遗传算法的分布是数据挖掘MapReduce架构在处理海量数据环境下的数据挖掘算法时具有良好的表现,通过特定问题的遗传算法实现,利用分布式计算提高算法性能,同时利用遗传算法的全局搜索优化能力快速找到最优解,研究表明,该架构帮助数据挖掘算法在处理海量数据时效果和性能得到提升。
[Abstract]:In recent years, with the rapid development of information technology, incalculable mass data is produced directly or indirectly, which brings new challenges to traditional data mining algorithms. How to improve the generality and performance of traditional data mining algorithms in mass data environment has become a hot research topic. In order to solve this problem, researchers combine traditional data mining algorithms with emerging technologies such as cloud computing platform, and improve the performance of the algorithm by using distributed computing power, and obtain good results. However, because there are many kinds of data mining algorithms, a single data mining algorithm needs a specific implementation pattern, there is no universal architecture to meet the diversity of data mining algorithms, and can improve the performance of the algorithm at the same time. Based on the previous experience, this paper proposes a distributed data mining MapReduce architecture based on genetic algorithm, which aims to help users process data mining algorithms more generally and improve the performance of the algorithms. MapReduce, one of the architectural elements, provides good distributed computing power, while the other element genetic algorithm has a good global search and optimization capability, and the optimal solution can be found by simulating population evolution. Users only need to implement genetic algorithm and do not have to worry about the parallelization of the algorithm. The main contributions of this paper are as follows: a distributed data mining MapRed uce architecture based on genetic algorithm is proposed. The architecture is divided into core layer and user layer. The core layer encapsulates the operation of MapReduce, and the user layer provides the user with extended interface. The application of data mining algorithm in mass data can be effectively processed by implementing specific genetic algorithm. The architecture consists of six components, in which the Diver component is the main part of the framework. The main function of the architecture is to realize user interaction and start the Jobs;Generator component on the cluster by calling the genetic algorithm in the user layer. Then the role of the evolutionary Terminator component to start the Job complete population with Driver is to determine whether the terminating condition is satisfied or not and initialize the population in the Generator process. The component is the optional Job component which is responsible for the implementation of the population migration strategy, which is implemented by the user layer. The final SolutionFilter component is to filter out qualified individuals, and each component collaborates with each other to complete the architectural functions. In this paper, three algorithms are used to verify the performance of the architecture. Firstly, the genetic algorithm for K-Medoids is designed and implemented. The clustering accuracy is taken as the individual fitness value, and the clustering calculation is strengthened by MapReduce. The experimental results show that the clustering effect is good. Secondly, a genetic algorithm for traveling salesman problem (Traveling Salesman Problem) is designed and implemented. The reciprocal of the city distance is taken as the fitness function. The shorter the distance is, the higher the fitness is. The experimental results show that, The TSP algorithm running in the architecture can deal with big data effectively and can find the optimal solution faster than the same level algorithm. Finally, a genetic algorithm for feature subset selection (Feature Subset Selection) problem is designed and implemented. The classification accuracy of feature selection is taken as the fitness value. The experimental results show that the FSS algorithm running in the framework can converge faster and improve the accuracy. In summary, the distribution based on genetic algorithm proposed in this paper is that the data mining MapReduce architecture has a good performance in dealing with the data mining algorithm under the massive data environment, which is realized by the genetic algorithm with specific problems. Distributed computing is used to improve the performance of the algorithm, and the global search optimization ability of genetic algorithm is used to quickly find the optimal solution. The research shows that the architecture can improve the efficiency and performance of the data mining algorithm in processing massive data.
【学位授予单位】：天津大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】