基于GPU的Dirichlet算法并行计算设计与实现

发布时间：2018-10-05 18:11

【摘要】：近年来,信息技术的普及和硬件技术的快速发展,为大数据产生与存储提供了先决条件。在商业上、科研机构、政府部门等都存储着大量的数据。而如何从这些大量的数据集中提取有用信息成为了人们日益关注的主题,数据挖掘正是在这样的背景下受到关注并得到了快速的发展。聚类作为数据挖掘的重要工具,是将相似对象划分为同组,不相似对象划为不同组的过程,在各个领域得到了广泛的应用。本文首先介绍了数据挖掘和聚类分析的基础理论,并重点研究了Dirichlet混合模型聚类,接着以Apache Mahout机器学习库为基础,研究了Dirichlet过程混合模型算法及其具体实现。该混合模型是一种以Dirichlet过程为先验的贝叶斯混合模型。Mahout提供了单机实现和MapReduce实现方式,本文主要研究了后者。文中首先以多组数据集作为算法输入来研究Dirichlet过程聚类算法,通过对运行结果的分析,得出算法主要开销集中在map函数的处理这一结论。本文还研究了GPU(图形处理器),并提出了以GPU并行方式来提高算法效率的改进方案。本文研究了GPU的体系架构及其优势,以及CUDA并行编程实现。然后在Mahout提供的Dirichlet过程混合模型算法源码基础上,实现了以JNI调用本地CUDA程序的改进方案,其中,CUDA程序以并行方式来处理了map函数。最后,本文以同样的数据作为输入,并分析了运行结果。通过比较源程序与改进程序的运行性能,得出改进的程序提高了算法效率,当数据量较大时,提升效果更为明显。这些为数据挖掘算法的性能研究提供有益参考。
[Abstract]:In recent years, the popularization of information technology and the rapid development of hardware technology provide a prerequisite for big data to produce and store. In business, research institutions, government departments and so on are storing a lot of data. However, how to extract useful information from these large data sets has become a topic of increasing concern. Data mining has been paid close attention to and developed rapidly under this background. As an important tool of data mining, clustering is the process of dividing similar objects into the same group and dissimilar objects into different groups, and has been widely used in various fields. In this paper, the basic theory of data mining and clustering analysis is introduced, and the Dirichlet hybrid model clustering is studied. Then, based on the Apache Mahout machine learning library, the Dirichlet process hybrid model algorithm and its implementation are studied. The hybrid model is a Bayesian hybrid model with Dirichlet process as a priori. Mahout provides a single machine implementation and a MapReduce implementation. The latter is mainly studied in this paper. In this paper, the multi-group data set is used as the input of the algorithm to study the clustering algorithm of Dirichlet process. Through the analysis of the running results, it is concluded that the main cost of the algorithm is the processing of the map function. This paper also studies GPU (graphics processor) and proposes an improved scheme to improve the efficiency of the algorithm by GPU parallelism. This paper studies the architecture and advantages of GPU, and the implementation of CUDA parallel programming. Then on the basis of the source code of Dirichlet process mixed model algorithm provided by Mahout, an improved scheme of calling local CUDA program by JNI is implemented, in which the map function is processed by JNI program in parallel. Finally, the same data is used as input and the result is analyzed. By comparing the performance of the source program and the improved program, it is concluded that the improved program improves the efficiency of the algorithm, and when the amount of data is large, the improvement effect is more obvious. These provide a useful reference for the performance research of data mining algorithms.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP311.13;TP338.6

【参考文献】