面向大数据的聚类挖掘算法研究

发布时间：2018-05-06 23:37

本文选题：大数据 + 聚类挖掘　；参考：《南京邮电大学》2015年硕士论文

【摘要】：大数据巨大的潜在价值促使大数据挖掘技术的产生,大数据挖掘是指从具有大规模性、高速性和多样性的数据源中挖掘出有价值知识的数据处理过程;如何准确、快速地从大数据中挖掘出有价值的知识是当今的研究热点。本文将面向大数据的聚类挖掘算法作为研究重点,以提高聚类挖掘算法的准确度和效率为研究目标,首先对传统聚类挖掘算法进行改进以提高准确度,然后对改进的聚类算法并行化以提高效率。为了提高聚类的准确度,本文在DBSCAN算法和k-means算法的基础之上,提出了基于密度的增量k-means聚类算法(Density-based Incremental k-means,DBIK-means)。DBIK-means算法首先计算数据点的密度,以密度不小于给定阈值的中心点以及在其密度范围内的点组合成各个基本簇;再依据两个簇中心点之间的距离合并基本簇;最后把没有划分到任意簇的点划分到与其距离最近的簇中。理论分析和基于KDD CUP 99数据集的实验结果表明了该算法能够发现任意形状的簇,对数据点的输入顺序以及参数不敏感,在时间开销仅略有增加的情况下可获得更高的聚类准确度,其总体性能优于k-means。为了提高DBIK-means算法的效率,降低算法的时间复杂度,本文利用分布式数据库来模拟共享存储空间,在云计算Hadoop平台上进行DBIK-means算法的并行化;通过仿真实验进行验证,实验结果表明DBIK-means算法适合大规模数据集的聚类挖掘。本文最后将DBIK-means聚类算法应用于电信客户的分类中,应用结果表明该聚类算法能够较为准确地将大量的电信客户自动划分到若干簇中,为电信运营商针对不同类型的客户制定不同的营销策略提供帮助。
[Abstract]:Big data's enormous potential value promotes the generation of big data mining technology. Big data mining refers to the data processing process of mining valuable knowledge from large-scale, high-speed and diverse data sources. Quickly excavating valuable knowledge from big data is a hot research topic. In this paper, we focus on the clustering mining algorithm for big data, aiming at improving the accuracy and efficiency of the clustering mining algorithm. Firstly, we improve the traditional clustering mining algorithm to improve the accuracy. Then the improved clustering algorithm is parallelized to improve the efficiency. In order to improve the accuracy of clustering, this paper proposes an incremental k-means clustering algorithm based on density based Incremental k-means.DBIK-means algorithm, which is based on the DBSCAN algorithm and the k-means algorithm. Firstly, the density of the data points is calculated by using the Dens-based Incremental k-means.DBIK-means algorithm. Each basic cluster is composed of the center point whose density is not less than a given threshold and the point in the range of its density, and then the basic cluster is merged according to the distance between the center points of the two clusters. Finally, the points which are not partitioned into arbitrary clusters are divided into the clusters nearest to them. Theoretical analysis and experimental results based on KDD CUP 99 data set show that the algorithm can find clusters with arbitrary shapes and is insensitive to the input order and parameters of data points. When the time cost is only slightly increased, higher clustering accuracy can be obtained, and its overall performance is better than that of k-means. In order to improve the efficiency of DBIK-means algorithm and reduce the time complexity of the algorithm, this paper uses distributed database to simulate shared storage space and parallelize DBIK-means algorithm on cloud computing Hadoop platform. Experimental results show that DBIK-means algorithm is suitable for clustering mining of large scale data sets. Finally, the DBIK-means clustering algorithm is applied to the classification of telecom customers. The application results show that the clustering algorithm can automatically divide a large number of telecom customers into a number of clusters accurately. Telecom operators for different types of customers to develop different marketing strategies to help.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP311.13

【参考文献】