改进K-means聚类算法的研究

发布时间：2018-03-07 18:21

本文选题：聚类分析　切入点：K-means算法　出处：《安徽大学》2015年硕士论文　论文类型：学位论文

【摘要】：信息技术的快速提升以及Web技术的兴起推动着数据信息的获取、存取向着自动化、快速化以及智能化发展。面对海量的、无规律的数据资源,数据挖掘技术应运而生。在数据挖掘研究中,聚类分析技术是其中一个重要的研究分支。聚类分析技术是一种无监督的、具有探索性的分类技术,它是在没有任何先验知识的前提下,将一个没有类别标识的数据集合进行划分,根据数据对象之间的相识度进行划分,结果是得到不同簇的集合。目前聚类分析技术应用在众多领域,如数据统计、电子商务、Web分析、生物医药、营销分析等。K-means算法是一个经典的聚类分析算法,算法基于划分技术,通过选取初始聚类中心将数据集进行合理的分类,根据生成的聚类的平均值来合理地调整聚类的中心点。算法通过多次迭代,最终实现簇内相似性最大,簇间相似性最小。K-means算法原理简单、容易实现,在对大规模数据集进行处理时具有较好的延展性和时间复杂度。但是,它仍存在许多的缺点,如：K-means算法对初始聚类中心的选择很敏感,中心的不当选择会造成聚类分析结果的较大误差；算法最终的分析结果往往是局部最优结果,但对于全局不是最优结果。此外,K-means算法需要事先给定初始聚类的个数k。本文以自适应特征权重和遗传算法为理论基础,解决了传统K-means算法中的部分不足,避免聚类分析结果陷入局部最优,有效提高算法的准确性和稳定性。针对传统K-means算法固定特征权重不灵活对初始聚类中心的选取有很大依赖性的缺点,可以按照属性重要程度越高,权值越大的原则对属性的权值进行调整,使人们可以清晰看出属性的重要级别。在不指定K值的前提下,算法根据数据对象密度的大小,在高密度集合中选取若干代表性的对象作为初始聚类中心,通过对准则函数的比较得出最优的K,算法在迭代的过程中依据簇类内尽可能相似、簇类间尽可能相异的准则变化属性的特征权重值。将遗传算法与自适应权重结合后运用在K-means算法上,对其进行改进,即在属性权重的基础上,用遗传算法的全局搜索能力来获得较优的聚类中心,最后使用K-means算法进行优化。这种方法能很好地降低K-means算法对初始中心的依赖性,提高算法的聚类效果。将此算法在实验数据集上进行试验后,并将其运用在聚类算法的应用领域之一的图像分割上,比较其分割效果。实验采用标准数据集对两个改进的算法进行验证,从准确率、迭代次数和聚类中心几个方面进行分析,并与传统K-means算法进行比较,证实了改进K-means聚类分析算法的高效性。
[Abstract]:The rapid improvement of information technology and the rise of Web technology promote the acquisition of data information, access to automation, rapid and intelligent development. In the research of data mining, clustering analysis is an important research branch. Clustering analysis is an unsupervised and exploratory classification technology. Without any prior knowledge, it divides a data set without class identification, and divides it according to the degree of acquaintance between data objects. The result is the collection of different clusters. At present, cluster analysis technology is applied in many fields, such as data statistics, e-commerce Web analysis, biomedicine, marketing analysis and so on. K-means algorithm is a classical clustering analysis algorithm, which is based on partitioning technology. By selecting the initial clustering center to classify the data set reasonably, the center point of the cluster can be adjusted reasonably according to the average value of the generated clustering. The algorithm achieves the maximum similarity in the cluster through multiple iterations. The algorithm of minimum similarity between clusters. K-means is simple in principle and easy to implement. It has good extensibility and time complexity in processing large data sets. However, it still has many shortcomings. Such as: K-means algorithm is very sensitive to the selection of initial clustering center, improper selection of center will result in a large error in the result of clustering analysis, the final analysis result of the algorithm is often the local optimal result. But the global is not the optimal result. In addition, the K-means algorithm needs to give the number of the initial clustering k. based on the adaptive feature weight and genetic algorithm, this paper solves some of the shortcomings of the traditional K-means algorithm. In order to avoid the clustering results falling into local optimum and effectively improve the accuracy and stability of the algorithm, the traditional K-means algorithm has the disadvantage that the fixed feature weights are inflexible and depend heavily on the selection of initial clustering centers. The weight of attribute can be adjusted according to the principle that the importance of attribute is higher and the weight of attribute is bigger, so that people can clearly see the importance level of attribute. Without specifying K value, the algorithm is based on the density of data object. Some representative objects are selected as the initial clustering center in the high density set. By comparing the criterion functions, the optimal Ks are obtained, and the algorithm is as similar as possible according to the cluster class in the iterative process. The genetic algorithm and adaptive weight are combined with K-means algorithm to improve the attribute weight, that is, on the basis of attribute weight. The global search ability of genetic algorithm is used to obtain the optimal clustering center, and the K-means algorithm is used to optimize the cluster center. This method can reduce the dependence of K-means algorithm on the initial center. After the experiment on the experimental data set, the algorithm is applied to the image segmentation, which is one of the application fields of the clustering algorithm. The experiment uses standard data set to verify the two improved algorithms, analyzes them from the aspects of accuracy, iteration times and clustering center, and compares them with the traditional K-means algorithm. The improved K-means clustering algorithm is proved to be efficient.
【学位授予单位】：安徽大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP311.13

【引证文献】