基于MapReduce的分布式快速聚类算法研究

发布时间：2018-01-23 11:05

本文关键词： 海量数据聚类 MapReduce 并行计算数据挖掘　出处：《东北电力大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着信息技术的高速发展,数据规模呈现指数级增长态势,传统聚类算法面临巨大的挑战。一是海量数据内的噪声杂、冗余度高、价值密度低,聚类算法的准确率不高;二是串行聚类算法面对海量数据时,搜索邻域代价巨大,执行效率无法适应实际需求。针对上述问题,本文充分分析数据特点,基于MapReduce大数据处理框架,设计了分布式快速聚类算法,实现了高效、高精度的并行数据聚类。针对海量数据中冗余度高,无价值数据繁多的问题,本文提出一种基于MapReduce的分布式数据约减算法。通过一种新的抽样算法计算数据点的矩形域和抽样域,并在抽样域中确定样本数据,然后对样本数据进行扩展抽样来达到约减原始数据集的目的,最后提出一种代表性验证策略来检验样本集,从而解决海量数据聚类产生巨大I/O开销和网络开销的问题。针对搜索最近邻代价消耗大,聚类执行效率低的问题,本文利用Map任务对样本数据集进行相等大小的数据划分,Reduce任务对数据子集进行局部密度聚类,因此针对单节点提出基于扩展区域查询的密度聚类算法。首先通过基于固定网格的扩展区域查询方法,确定数据点最近邻和反最近邻的邻域关系,建立每个数据点的影响空间域,然后提出异常点判定函数,使算法能够准确地识别噪声点和边界点。Reduce聚类任务结束后输出局部聚类结果,为得到面向整个数据集的全局聚类结果,本文提出一种基于簇间距离的局部类簇合并算法,通过簇间距离的计算确定局部类簇间的分布关系,得到可以两两合并的局部类簇对,然后根据连通子图发现方法合并局部类簇对,最后输出全局聚类结果。实验结果表明,本文提出的算法有效地将海量数据进行约减,保证了样本数据与原始数据分布的一致性,在信息量无损失的前提下降低了数据冗余,并且该算法能够快速处理任意形状的类簇,大幅度提高了算法的执行效率和聚类质量。
[Abstract]:With the rapid development of information technology, the scale of data is increasing exponentially. The traditional clustering algorithm is faced with great challenges. One is the noise clutter, high redundancy and low value density in massive data. The accuracy of clustering algorithm is not high; Second, the serial clustering algorithm in the face of massive data, the search cost is huge, the execution efficiency can not meet the actual needs. In view of the above problems, this paper fully analyzes the characteristics of the data. Based on the MapReduce big data processing framework, a distributed fast clustering algorithm is designed to achieve efficient and accurate parallel data clustering. In this paper, a distributed data reduction algorithm based on MapReduce is proposed, and a new sampling algorithm is proposed to calculate the rectangular and sampling regions of data points. The sample data is determined in the sampling domain, and then the sample data is extended to reduce the original data set. Finally, a representative validation strategy is proposed to test the sample set. In order to solve the problem of huge I / O overhead and network overhead caused by massive data clustering, aiming at the problem of high cost of searching nearest neighbor and low efficiency of clustering execution. In this paper, the Map task is used to partition the sample data set with equal size and reduce task to cluster the local density of the data subset. Therefore, a density clustering algorithm based on extended region query is proposed for single node. Firstly, the neighborhood relationship between nearest neighbor and anti-nearest neighbor of data point is determined by the extended region query method based on fixed grid. The influence spatial domain of each data point is established, and the outlier decision function is proposed, which enables the algorithm to accurately identify the noise point and the boundary point. In order to obtain the global clustering results for the whole data set, this paper proposes a local cluster merging algorithm based on the distance between clusters, which determines the distribution of local clusters by calculating the distance between clusters. The local cluster pairs which can be merged by pairwise are obtained, and the local cluster pairs are merged according to the connected subgraph discovery method. Finally, the global clustering results are outputted. The experimental results show that. The algorithm proposed in this paper can effectively reduce the mass data, ensure the consistency of the distribution between the sample data and the original data, and reduce the data redundancy without loss of information. Moreover, the algorithm can deal with arbitrary clusters quickly, which greatly improves the efficiency and clustering quality of the algorithm.
【学位授予单位】：东北电力大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】