基于k-means的自动三支决策聚类方法

发布时间：2018-05-18 13:47

本文选题：聚类 + 三支决策　；参考：《重庆邮电大学》2016年硕士论文

【摘要】：k-means算法简单易懂,效率高,自提出50多年来在聚类分析中得到了广泛的应用。然而,k-means算法也存在不足之处,即需要人为设定聚类数目。但是,聚类分析是一种无监督的方法,在没有先验知识的情况下很难事先确定聚类数目。另一方面,在实际应用中一个对象和类存在多种关系:即一个对象确定属于一个类;一个对象确定不属于一个类;一个对象可能属于也可能不属于一个类,即根据目前获得的信息难以确定地判断对象与类的关系。比如,在社交网络、生物信息处理与电子商务等领域中这种不确定性现象非常普遍。k-means算法得到的聚类结果其实是一种二支决策聚类结果,对象和类之间只有两种关系,即对象要么属于一个类,要么不属于一个类。因此,传统的k-means聚类方法不能有效地处理这种带有不确定现象的聚类任务。为此,本文针对这种带有不确定性现象的聚类问题进行了研究,并给出了基于k-means算法框架的自动确定聚类数目的解决方案。1.针对k-means算法聚类数目难以自动确定的难题,本文提出了新的用于度量聚类结果的有效性指数。定义了考虑近邻的分离性指数和新的紧凑性指数,提出了一种基于差值排序的聚类有效性指数,从而提出一种自动的k-means聚类算法。文中的有效性指数考虑对象和邻居的分布情况以及类中对象数目两个因素,能够很好地度量聚类结果。2.针对传统二支聚类的局限之处,引入三支决策思想扩展原有聚类结果。传统的k-means算法得到聚类结果其实是一种二支决策结果。然而,同一类中的对象对于类的形成起着不同的作用。有些对象是类中的典型对象,确定属于该类;有些对象和类有着密切联系,但是并不是该类的典型对象,可能属于该类;有些对象和类没有多少联系,确定不属于该类。这是一种典型的三支决策结果,即:对象确定属于某类、可能属于某类和确定不属于某类。文中引入三支决策思想,结合定义的有效性指数,提出一种基于k-means的自动三支决策聚类方法。文中提出的基于k-means的三支决策聚类算法,一方面能够自动确定类簇个数;另一方面文中得到的聚类结果对类中对象做进一步区分能够得到更加丰富的聚类结果,便于对聚类结果做进一步的分析。实验表明,文中提出的有效性指数优于对比的有效性指数。相较于传统的二支决策聚类算法,文中提出的三支决策聚类算法能够显著提高聚类准确率。
[Abstract]:K-means algorithm is easy to understand and efficient. It has been widely used in clustering analysis since it was proposed for more than 50 years. However, the k-means algorithm also has some shortcomings, that is, the number of clustering needs to be set artificially. However, clustering analysis is an unsupervised method, and it is difficult to determine the number of clusters in advance without prior knowledge. On the other hand, there are many relationships between an object and a class in a practical application: an object is determined to belong to a class; an object to not belonging to a class; an object may or may not belong to a class; That is to say, it is difficult to judge the relation between object and class according to the information obtained at present. For example, in the fields of social network, biological information processing and electronic commerce, this kind of uncertainty phenomenon is very common. The clustering result obtained by the k-means algorithm is actually a two-branch decision clustering result, and there are only two kinds of relationships between objects and classes. An object belongs either to a class or not to a class. Therefore, the traditional k-means clustering method can not effectively deal with this clustering task with uncertainty. Therefore, this paper studies the clustering problem with uncertainty, and gives a solution to determine the number of clusters automatically based on k-means algorithm framework. In order to solve the problem that the clustering number of k-means algorithm is difficult to determine automatically, this paper presents a new validity index to measure the clustering results. In this paper, the separation index and the new compactness index are defined, and a clustering validity index based on difference ordering is proposed, and an automatic k-means clustering algorithm is proposed. In this paper, the validity index takes into account the distribution of objects and neighbors and the number of objects in the class, which can well measure the clustering results. In view of the limitation of traditional two-branch clustering, the three-branch decision idea is introduced to extend the original clustering results. The traditional k-means algorithm to obtain clustering results is actually a two-family decision-making results. However, objects in the same class play a different role in class formation. Some objects are typical objects in a class, which are determined to belong to the class; some objects are closely related to the class, but not typical objects of the class, which may belong to the class; some objects and classes do not have much connection to determine that they do not belong to this class. This is a typical three-branch decision result, that is, the object determination belongs to a certain class, may belong to a certain class and does not belong to a certain class. An automatic three-branch decision clustering method based on k-means is proposed by introducing the idea of three-branch decision making and combining with the defined validity index. The three-branch decision clustering algorithm based on k-means, on the one hand, can automatically determine the number of clusters, on the other hand, the clustering results obtained in this paper can further distinguish the objects in the cluster and obtain more abundant clustering results. It is convenient to further analyze the clustering results. The experimental results show that the proposed validity index is better than the contrast validity index. Compared with the traditional two-branch decision clustering algorithm, the proposed three-branch decision clustering algorithm can significantly improve the clustering accuracy.
【学位授予单位】：重庆邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】