大数据分析中的聚类算法研究

发布时间：2018-06-24 17:03

本文选题：聚类分析 + Hadoop　；参考：《安徽理工大学》2016年硕士论文

【摘要】：随着信息技术特别是移动通讯技术的发展,社交网络、物联网、云计算等相继进入人们的日常工作和生活中,人们积累了大量数据,并且数据仍然呈快速增长趋势。面对海量的数据,如何从中挖掘出有价值的信息成为许多领域广泛研究的问题。聚类分析是数据挖掘和机器学习中常见的技术,在在学术和工业领域被大量使用。然而,传统的聚类算法以串行方法对数据进行处理,当应用于海量数据分析时,由于内存限制等原因,其效率不高,不能满足当前对海量数据处理的需要。为应对海量数据的挑战,提高聚类算法的效率,并行聚类技术成为当前研究的热点。Hadoop当前广泛使用数据分析平台,它是对MapRedcue计算模型和分布式存储系统GFS(Google File System)的开源实现。Hadoop因其易用性和良好的扩展性,已成为大数据分析的核心之一。Spark是当前十分流行的分布式计算计算平台,它实现了一种基于内存的分布式数据结构,并且提供了简单且强度的的编程接口,可以被用来构建大数据分析中的聚类算法。本文分析了对上述大数据处理平台进行了对比,详细分析了其并行化原理,论述了如何将聚类算法并行化以对海量数据进行处理。本文分析大数据分析中典型的聚类算法,分析了它们各自的特点及应用场景,同时提出一种基于预测强度大数据集k-均值聚类算法,并给出其在上述两个平台上的实现。
[Abstract]:With the development of information technology, especially mobile communication technology, social networks, Internet of things, cloud computing and so on have entered people's daily work and life, people have accumulated a lot of data, and the data is still growing rapidly. In the face of massive data, how to extract valuable information from it has become a widely studied problem in many fields. Clustering analysis, a common technology in data mining and machine learning, is widely used in academic and industrial fields. However, the traditional clustering algorithm uses serial method to process data. When applied to mass data analysis, due to memory constraints and other reasons, its efficiency is not high, which can not meet the needs of mass data processing. In order to meet the challenge of massive data and improve the efficiency of clustering algorithm, parallel clustering technology has become a hot topic in current research. Hadoop is widely used in data analysis platform. It is an open source implementation of the MapRedcue computing model and the distributed storage system. Hadoop has become one of the core of big data analysis because of its ease of use and good expansibility. It implements a memory-based distributed data structure and provides a simple and powerful programming interface which can be used to construct clustering algorithms in big data analysis. This paper analyzes and compares the above big data processing platform, analyzes the principle of parallelization in detail, and discusses how to parallelize the clustering algorithm to deal with massive data. In this paper, the typical clustering algorithms in big data analysis are analyzed, and their respective characteristics and application scenarios are analyzed. At the same time, a big data set k-means clustering algorithm based on predictive strength is proposed, and its implementation on the above two platforms is given.
【学位授予单位】：安徽理工大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】