k-means聚类算法的改进研究及应用

发布时间：2018-06-25 01:02

本文选题：改进k-means算法 + BWP指标值　；参考：《兰州交通大学》2017年硕士论文

【摘要】：数据挖掘是从大量、杂乱无章的数据中,提取到深层且有价值信息的过程。数据挖掘应用涉及到多种技术,主要包括聚类、分类、关联以及预测控制等方面。其中,聚类分析是数据挖掘的一个重要方向,是一个把数据集对象划分成不相容子集的过程。目前,聚类分析已经广泛地运用于很多领域,如Web搜索、人工智能、信息检索、图像模式识别、空间数据库技术和市场营销等。目前,被人们熟知且广泛使用的聚类方法有:划分方法、层次方法、基于密度的方法、基于网格的方法和基于概率模型的方法[1]。k-means算法是常用的划分聚类算法,具有原理简单、便于理解和实现、能处理大数据集等优点。给定训练数据集和聚类数,该算法即可依据准则函数将数据集迭代聚类,直到函数不再发生变化或达到约定的阈值为止。该算法的缺点主要有:聚类数需要事先给定,聚类结果对选取的初始中心点和数据集中的噪声点敏感和聚类结果可能是局部最优解等。本文主要针对k-means算法中聚类数需要事先给定、初始中心点的选取对聚类结果影响较大以及聚类结果对异常点敏感这三方面的缺点做出了相应改进,提出了一种改进的基于最大最小距离的k-means聚类算法。该算法在利用最大最小距离方法时,先利用分治算法思想把参数值θ所在的理论区间分解成较小区间,在每一个小区间上选取一个数作为θ值,依据不同的θ值分别对数据集进行聚类,去掉聚类效果不好的区间,然后利用连续属性离散化的思想对剩余区间进行离散,θ取遍离散化后的区间端点值,对数据集进行聚类,利用95%的有序BWP指标值的均值来衡量聚类结果,均值越大,说明聚类效果越好,最大的均值对应着最好的聚类结果。该改进算法解决了k-means聚类算法的聚类数需要事先给定、对初始中心点的选取和异常点较敏感的问题。为验证改进算法的有效性,文章选取UCI数据库中的三个数据集,并分别用不同的聚类算法进行分析,结果表明改进算法准确率更高,具有更好的聚类效果。最后,文章选取浙江省杭州市部分电信用户数据集为研究对象,一方面,利用传统k-means算法、基于最大最小距离的k-means算法和改进k-means算法分别对其进行聚类分析,结果表明改进算法聚类效果更好,类簇间差异更明显;同时,针对不同类别群体进行特征总结分析,定义类别名称,并制定差异化的营销方案,以此来提高行业服务质量。另一方面,根据logistic建模步骤及方法,本文利用历史数据训练logistic分类模型,对细分人群进行流失率预测,以便企业提前做好对流失用户的挽留措施。
[Abstract]:Data mining is the process of extracting deep and valuable information from a lot of messy data. Data mining applications involve a variety of technologies, including clustering, classification, association and predictive control. Among them, clustering analysis is an important direction of data mining, and it is a process of dividing dataset objects into incompatible subsets. At present, clustering analysis has been widely used in many fields, such as Web search, artificial intelligence, information retrieval, image pattern recognition, spatial database technology and marketing. At present, the widely used clustering methods are as follows: partitioning method, hierarchical method, density-based method, grid-based method and probabilistic model-based method [1] .k-means algorithm. Easy to understand and implement, can deal with big data set and other advantages. Given the training data set and the clustering number, the algorithm can cluster the data set iteratively according to the criterion function until the function no longer changes or reaches the agreed threshold. The main disadvantages of this algorithm are that the number of clusters needs to be given beforehand, the clustering results are sensitive to the selected initial center points and the noise points in the data sets, and the clustering results may be local optimal solutions, etc. In this paper, the clustering number needs to be given in the k-means algorithm, the selection of the initial center has a great influence on the clustering results and the clustering results are sensitive to the outliers. An improved k-means clustering algorithm based on maximum and minimum distance is proposed. When the maximum and minimum distance method is used, the theoretical interval in which the parameter value 胃 is decomposed into smaller intervals, and a number is selected as the 胃 value in each interval. According to the different 胃 values, the data sets are clustered separately to remove the regions with poor clustering effect, then the remaining intervals are discretized by the idea of continuous attribute discretization, and the data sets are clustered according to the values of the end points of the interval after 胃 is discretized. The average value of 95% ordered BWP index is used to measure the clustering result. The larger the average value is, the better the clustering effect is, and the maximum mean value corresponds to the best clustering result. The improved algorithm solves the problem that the clustering number of k-means clustering algorithm needs to be given beforehand and sensitive to the selection of initial center points and outliers. In order to verify the effectiveness of the improved algorithm, three datasets in UCI database are selected and analyzed with different clustering algorithms. The results show that the improved algorithm has higher accuracy and better clustering effect. Finally, this paper selects some telecom data sets in Hangzhou, Zhejiang Province as the research object. On the one hand, the traditional k-means algorithm, the k-means algorithm based on the maximum and minimum distance and the improved k-means algorithm are used to cluster the data sets. The results show that the improved clustering algorithm is more effective and the difference between clusters is more obvious. At the same time, the characteristics of different groups are summarized and analyzed, category names are defined, and differentiated marketing schemes are formulated to improve the service quality of the industry. On the other hand, according to the steps and methods of logistic modeling, this paper uses historical data to train the logistic classification model to predict the loss rate of the subdivided population, so that enterprises can do a good job of retaining the lost users in advance.
【学位授予单位】：兰州交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】