基于MapReduce的分布式快速聚类算法研究
本文关键词: 海量数据 聚类 MapReduce 并行计算 数据挖掘 出处:《东北电力大学》2017年硕士论文 论文类型:学位论文
【摘要】:随着信息技术的高速发展,数据规模呈现指数级增长态势,传统聚类算法面临巨大的挑战。一是海量数据内的噪声杂、冗余度高、价值密度低,聚类算法的准确率不高;二是串行聚类算法面对海量数据时,搜索邻域代价巨大,执行效率无法适应实际需求。针对上述问题,本文充分分析数据特点,基于MapReduce大数据处理框架,设计了分布式快速聚类算法,实现了高效、高精度的并行数据聚类。针对海量数据中冗余度高,无价值数据繁多的问题,本文提出一种基于MapReduce的分布式数据约减算法。通过一种新的抽样算法计算数据点的矩形域和抽样域,并在抽样域中确定样本数据,然后对样本数据进行扩展抽样来达到约减原始数据集的目的,最后提出一种代表性验证策略来检验样本集,从而解决海量数据聚类产生巨大I/O开销和网络开销的问题。针对搜索最近邻代价消耗大,聚类执行效率低的问题,本文利用Map任务对样本数据集进行相等大小的数据划分,Reduce任务对数据子集进行局部密度聚类,因此针对单节点提出基于扩展区域查询的密度聚类算法。首先通过基于固定网格的扩展区域查询方法,确定数据点最近邻和反最近邻的邻域关系,建立每个数据点的影响空间域,然后提出异常点判定函数,使算法能够准确地识别噪声点和边界点。Reduce聚类任务结束后输出局部聚类结果,为得到面向整个数据集的全局聚类结果,本文提出一种基于簇间距离的局部类簇合并算法,通过簇间距离的计算确定局部类簇间的分布关系,得到可以两两合并的局部类簇对,然后根据连通子图发现方法合并局部类簇对,最后输出全局聚类结果。实验结果表明,本文提出的算法有效地将海量数据进行约减,保证了样本数据与原始数据分布的一致性,在信息量无损失的前提下降低了数据冗余,并且该算法能够快速处理任意形状的类簇,大幅度提高了算法的执行效率和聚类质量。
[Abstract]:With the rapid development of information technology, the scale of data is increasing exponentially. The traditional clustering algorithm is faced with great challenges. One is the noise clutter, high redundancy and low value density in massive data. The accuracy of clustering algorithm is not high; Second, the serial clustering algorithm in the face of massive data, the search cost is huge, the execution efficiency can not meet the actual needs. In view of the above problems, this paper fully analyzes the characteristics of the data. Based on the MapReduce big data processing framework, a distributed fast clustering algorithm is designed to achieve efficient and accurate parallel data clustering. In this paper, a distributed data reduction algorithm based on MapReduce is proposed, and a new sampling algorithm is proposed to calculate the rectangular and sampling regions of data points. The sample data is determined in the sampling domain, and then the sample data is extended to reduce the original data set. Finally, a representative validation strategy is proposed to test the sample set. In order to solve the problem of huge I / O overhead and network overhead caused by massive data clustering, aiming at the problem of high cost of searching nearest neighbor and low efficiency of clustering execution. In this paper, the Map task is used to partition the sample data set with equal size and reduce task to cluster the local density of the data subset. Therefore, a density clustering algorithm based on extended region query is proposed for single node. Firstly, the neighborhood relationship between nearest neighbor and anti-nearest neighbor of data point is determined by the extended region query method based on fixed grid. The influence spatial domain of each data point is established, and the outlier decision function is proposed, which enables the algorithm to accurately identify the noise point and the boundary point. In order to obtain the global clustering results for the whole data set, this paper proposes a local cluster merging algorithm based on the distance between clusters, which determines the distribution of local clusters by calculating the distance between clusters. The local cluster pairs which can be merged by pairwise are obtained, and the local cluster pairs are merged according to the connected subgraph discovery method. Finally, the global clustering results are outputted. The experimental results show that. The algorithm proposed in this paper can effectively reduce the mass data, ensure the consistency of the distribution between the sample data and the original data, and reduce the data redundancy without loss of information. Moreover, the algorithm can deal with arbitrary clusters quickly, which greatly improves the efficiency and clustering quality of the algorithm.
【学位授予单位】:东北电力大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【参考文献】
相关期刊论文 前10条
1 韩岩;李晓;;加速大数据聚类K-means算法的改进[J];计算机工程与设计;2015年05期
2 于彦伟;王欢;王沁;赵金东;;面向海量数据流的基于密度的簇结构挖掘算法[J];软件学报;2015年05期
3 程学旗;靳小龙;王元卓;郭嘉丰;张铁赢;李国杰;;大数据系统和分析技术综述[J];软件学报;2014年09期
4 陆颖华;马廷淮;钟水明;曹杰;王新;Abdullah Al-Dhelaane;;Improved locality-sensitive hashing method for the approximate nearest neighbor problem[J];Chinese Physics B;2014年08期
5 崔颖安;李雪;王志晓;张德运;;在线社交媒体数据抽样方法的比较研究[J];计算机学报;2014年08期
6 顾荣;严金双;杨晓亮;袁春风;黄宜华;;Hadoop MapReduce短作业执行性能优化[J];计算机研究与发展;2014年06期
7 崔颖安;李雪;王志晓;张德运;;社会化媒体大数据多阶段整群抽样方法[J];软件学报;2014年04期
8 张引;陈敏;廖小飞;;大数据应用的现状与展望[J];计算机研究与发展;2013年S2期
9 刘淑芬;孟冬雪;王晓燕;;基于网格单元的DBSCAN算法[J];吉林大学学报(工学版);2014年04期
10 刘义;景宁;陈荦;熊伟;;MapReduce框架下基于R-树的k-近邻连接算法[J];软件学报;2013年08期
,本文编号:1457393
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1457393.html