当前位置:主页 > 科技论文 > 软件论文 >

改进的近邻传播聚类算法及其应用研究

发布时间:2018-06-12 17:18

  本文选题:近邻传播聚类 + 加权马氏距离 ; 参考:《南京理工大学》2017年硕士论文


【摘要】:聚类分析是多元统计分析的一个重要组成部分,广泛应用于社会生活的各个领域。近邻传播聚类算法是一种新型无监督聚类算法,由Frey和Dueck于2007年提出。该算法不需要给定初始聚类中心和簇的数量,只要构造相似度矩阵,建立偏向参数,即可通过消息传递机制,自动确定适合的类代表点。初步研究表明该算法具有许多优良的性质,如运算速度快、误差平方和小、聚类精度高等,但也有不足之处。首先,AP算法选择负的欧式距离作为其相似度度量,但欧式距离只适用于样本相互独立的情况,易受量纲的影响,且认为每个属性对距离的重要性相同。本文提出基于均方差的加权马氏距离,再将此加权马氏距离的负数作为AP算法的相似度度量,马氏距离能够自适应地调整数据的几何分布,消除属性之间相关性的干扰,基于均方差给属性加权,又综合考虑了属性相对重要程度对最终聚类的影响。采用此相似度度量,不仅扩大了算法的应用范围,也使聚类结果更精确。其次,AP算法中将每个点的偏向参数P设置为相同的值,即默认全部样本点成为类代表的可能性相等,忽略了数据分布特性对某点能否成为类代表的影响。针对此缺陷,本文提出基于其它所有点到某点的隶属度之和越大则该点成为类代表可能性越大的假设来设置P,实现了不同的点赋予不同的P值。基于数据特性设置P值,即事先给成为类代表可能性大的点赋予更高的P值,减少算法迭代次数及运行时间。同时,本文基于柯西收敛准则,实证分析了模型中归属度矩阵及吸引度矩阵的收敛性。最后,为获得从1到k的k个聚类,提出自适应步长,动态调整P值进行聚类的方法,并在此基础上研究了P值与聚类数目的关系,进一步对模型进行了优化。并利用Gap指标估计出最佳聚类数。通过对UCI数据库中的一些数据集进行仿真实验,证明了该模型具有可行性和优越性。
[Abstract]:Cluster analysis is an important part of multivariate statistical analysis, which is widely used in various fields of social life. The nearest neighbor propagation clustering algorithm is a new unsupervised clustering algorithm proposed by Frey and Dueck in 2007. The algorithm does not need to give the initial cluster center and the number of clusters. As long as the similarity matrix is constructed and the bias parameters are established, the appropriate representative points of the class can be automatically determined by the message passing mechanism. The preliminary research shows that the algorithm has many excellent properties, such as fast operation speed, small sum of error square and high clustering accuracy, but it also has some shortcomings. First of all, the AP algorithm chooses negative Euclidean distance as its similarity measure, but Euclidean distance is only suitable for the case where samples are independent of each other, so it is easy to be affected by dimensionality, and the importance of each attribute to distance is considered to be the same. In this paper, weighted Markov distance based on mean-variance is proposed, and the negative number of weighted Markov distance is used as the similarity measure of AP algorithm. Markov distance can adaptively adjust the geometric distribution of data and eliminate the interference of correlation between attributes. The influence of relative importance of attributes on final clustering is considered synthetically based on weighted attributes based on mean square error (RMS). This similarity measure not only expands the application scope of the algorithm, but also makes the clustering result more accurate. Secondly, in the AP algorithm, the bias parameter P of each point is set to the same value, that is, the probability that all sample points become class representative by default is equal, and the influence of data distribution on whether a point can represent a class is ignored. In order to solve this problem, this paper proposes the assumption that the greater the sum of membership degrees from all other points to a certain point, the greater the possibility that the point becomes a class representative, and the different points assign different P values. Setting P value based on the data characteristic, that is to say, it gives higher P value to the point which is more likely to represent the class, and reduces the iteration times and running time of the algorithm. At the same time, based on Cauchy convergence criterion, the convergence of attribution matrix and attraction matrix in the model is analyzed empirically. Finally, in order to obtain k clusters from 1 to k, an adaptive step size method is proposed to dynamically adjust P value to cluster. Based on this, the relationship between P value and the number of clusters is studied, and the model is further optimized. The best clustering number is estimated by Gap index. Simulation experiments on some data sets in UCI database show that the model is feasible and superior.
【学位授予单位】:南京理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13

【参考文献】

相关期刊论文 前6条

1 邢艳;周勇;;基于互近邻一致性的近邻传播算法[J];计算机应用研究;2012年07期

2 付迎丁;兰巨龙;;基于核自适应的近邻传播聚类算法[J];计算机应用研究;2012年05期

3 周世兵;徐振源;唐旭清;;基于近邻传播算法的最佳聚类数确定方法比较研究[J];计算机科学;2011年02期

4 谷瑞军;汪加才;陈耿;陈圣磊;;面向大规模数据集的近邻传播聚类[J];计算机工程;2010年23期

5 董俊;王锁萍;熊范纶;;可变相似性度量的近邻传播聚类[J];电子与信息学报;2010年03期

6 王开军;张军英;李丹;张新娜;郭涛;;自适应仿射传播聚类[J];自动化学报;2007年12期



本文编号:2010441

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2010441.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户21ecb***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com