基于密度和距离的K-means算法研究与应用

发布时间：2018-05-12 23:19

本文选题：数据挖掘 + 聚类分析　；参考：《西安理工大学》2016年硕士论文

【摘要】：数据挖掘是在大数据集上的探索并揭示其隐含规律的一种计算过程,它融合了众多的技术,是计算机科学的重要分支。其中聚类分析是数据挖掘中重要的分析技术之一,聚类分析是按照相似度进行划分的,将本身没有类别的数据样本划分成不同的簇。本文选取的是K-means算法进行研究,它是数据挖掘中最基本的聚类算法。该算法的优点是执行简单、操作方便,但是也存在着诸多的缺点,比如:聚类形成的簇数K是由用户指定;初始聚类中心是随机选取的;该算法只能发现类球状的簇等。本文的工作主要分为以下三个方面:首先在K-means算法的理论研究上,一方面,剔除了影响聚类结果的孤立点和对初始聚类中心选取进行了改进。另一方面,在确定了初始聚类中心后,把数据点合理地分配到各个簇中;其次为了能对海量数据进行处理,把改进算法在Spark平台上进行了实现;最后本文将改进算法应用到了移动客户细分中。实验证明,改进后的K-means算法比传统K-means算法的聚类结果更加准确。本文借助Spark平台对改进算法进行并行化实现,从而能够在不影响准确度的情况下减少算法执行的时间。通过对细分变量的选择,把采集到的移动客户数据根据相似性划分成不同的类别,能够帮助移动数据信息分析人员对不同的客户群采取不同的营销策略。
[Abstract]:Data mining is a kind of computing process to explore and reveal the hidden rules on big data set. It combines many technologies and is an important branch of computer science. Clustering analysis is one of the most important analysis techniques in data mining. Clustering analysis is divided according to similarity, and the data samples that do not have a category are divided into different clusters. In this paper, K-means algorithm is selected, which is the most basic clustering algorithm in data mining. The advantage of this algorithm is that it is simple to execute and easy to operate, but it also has many disadvantages, such as: the number of clusters K formed by clustering is specified by the user; the initial cluster center is randomly selected; and the algorithm can only find globular clusters and so on. The work of this paper is divided into the following three aspects: firstly, in the theoretical research of K-means algorithm, on the one hand, the isolated points that affect the clustering results are eliminated and the selection of initial clustering centers is improved. On the other hand, after determining the initial clustering center, the data points are allocated to each cluster reasonably. Secondly, in order to process the massive data, the improved algorithm is implemented on the Spark platform. Finally, the improved algorithm is applied to mobile customer segmentation. Experimental results show that the improved K-means algorithm is more accurate than the traditional K-means algorithm. In this paper, the improved algorithm is parallelized with the help of Spark platform, which can reduce the execution time of the algorithm without affecting the accuracy. Through the selection of subdivision variables, the collected mobile customer data can be divided into different categories according to similarity, which can help mobile data information analysts to take different marketing strategies for different customer groups.
【学位授予单位】：西安理工大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】