面向复杂数据的聚类算法研究

发布时间:2018-04-04 18:23

  本文选题:聚类 切入点:环境污染抽样数据 出处:《兰州大学》2016年博士论文


【摘要】:近年来,数据量急剧增长,数据源的种类日益增多,导致从复杂的数据中获取有用的信息变得越来越困难。要想很好地利用这些数据,就必须理解这些复杂的数据,从中挖掘出其内在的模式。聚类分析能根据数据间的相似性识别出数据集中的内在模式。但很多聚类算法在划分不同类型的数据集时,都会遇到精确性不高或者执行效率较低等问题,这就需要投入更多的精力去提高聚类算法的性能。本论文以更高效、更精确地对复杂数据进行聚类为目的,针对三种不同类型的数据,集中在聚类研究的三个方面,提出了四个聚类算法:EPC、MulSim、CLUB和SUM。EPC是一个根据污染特征将大气污染抽样数据进行聚类的算法,它能提高CMB、PMF、UNMIX和PCA等源解析模型的精确性,并且和传统算法算法相比更易于使用、更适合于聚类高维数据;MulSim和CLUB是挖掘数据集中包含的任意形状、任意密度以及任意规模的簇的两个聚类算法,其中,MulSim基于单点与多点相似的策略进行聚类,CLUB通过识别簇的密度主干进行聚类;SUM是对图数据中的顶点进行聚类的算法,其基本原理是质疑簇中的最大度顶点在聚类时对其他顶点的连接作用。(1)EPC EPC在第一步对数据进行预处理后,迭代进行第二步,每次迭代选择第一个未标记的数据点作为一个簇中心点,然后根据本文提出的相似性函数和用户给定的相似性阈值,把每个数据点分配到与它最相似的中心点所属的簇,最后利用与k-Means相似的方法对簇进行更新,形成最终的簇结构。本文在实验部分通过人工数据集和真实数据集分别验证了EPC算法的有效性。结果表明,EPC算法不但能根据污染特征的相似性对环境污染抽样数据进行聚类,而且还能同时检测出其中的异常点。(2)MulSim MulSim定义了一个能自动适应数据点密度变化的相似性函数,若一个数据点同时与另一个数据点以及该点的邻居相似,就认为这两个数据点属于同一个簇。实验结果显示,在测试的任意密度数据集、统一密度数据集、簇内包含多个中心点的数据集、包含螺旋形簇的数据集、包含球状簇的数据集、包含任意形状簇的数据集以及多维数据集等各种类型的数据集上,MulSim的聚类质量在多数情况下优于六个对比算法。(3)CLUB CLUB首先基于互k最近邻方法发现初始簇,接着将初始簇作为算法第二步的输入,基于k最近邻方法识别出簇的密度主干。然后,通过把无标签的数据点分配给密度比它大的最近邻所在的簇以形成最终簇结构。最后,从簇的内部检测出异常点。实验部分在九个包含任意形状、任意密度、任意规模簇的二维数据集以及七个广泛使用的多维数据集上,通过与三个经典算法、三个新算法进行比较,对CLUB的性能进行了评价。而且,还将CLUB应用于Olivetti Face数据集上,展示了其在人脸识别中的有效性。实验结果显示,CLUB在大多数情况下优于对比算法。(4)SUM SUM利用相邻顶点间的公共邻居个数和较小度顶点的度定义了一个相似度函数。在将相似的顶点放置到同一个簇中之后,SUM质疑簇中的最大度顶点对其他顶点的连接作用,断开簇中最大度顶点与其邻居顶点的连线,将最大度顶点重新分配来获得初始簇。然后,SUM将尚未标记的点分配给初始簇后,调整边界点以形成最终簇。通过与四个经典的、两个新的图聚类算法在四个有真实簇结构、四个无真实簇结构的图上的实验比较显示,SUM能够较精确地检测出簇结构,并且结果优于对比算法。四个算法的时间复杂度都接近于线性复杂度。所以,这四个算法均能以较高的精确性对其相应特征的数据集高效地进行聚类分析。
[Abstract]:In recent years, the explosive growth of data types of data sources is increasing, to become more and more difficult to obtain useful information from complex data. In order to make good use of these data, we must understand these complex data, dig out its inherent pattern. From the cluster analysis according to the data similarity identify internal mode data sets. But a lot of clustering algorithm in different data sets, will meet the accuracy is not high or low efficiency problems, which need to devote more energy to improve the performance of clustering algorithms. In this paper, more efficient and more accurate for complex data clustering for the purpose, for three different types of data, focused on the three aspects of clustering research, propose four clustering algorithms: EPC, MulSim, CLUB and SUM.EPC is a according to the pollution characteristics of air pollution will be sampled data The clustering algorithm, it can improve the accuracy of the model CMB, PMF, source apportionment of UNMIX and PCA, and compared with traditional algorithm algorithm is easier to use, more suitable for clustering high dimensional data mining; MulSim and CLUB are arbitrary data set contains the arbitrary density and two arbitrary scale clustering algorithm. The cluster, MulSim cluster based on single point and multi-point similar strategy, CLUB cluster by density cluster SUM is the main recognition; clustering of graph vertices in the algorithm, the basic principle is to question connection in the cluster with maximum degree vertex to other vertices (in clustering. 1) EPC EPC in the first step of data pre-processing, iterative second step, each iteration selects the first unlabeled data points as a center of the cluster, and then based on the similarity function and the user to set the similarity threshold Value, assign each data point to the center of the most similar and it belongs to the cluster, finally using the method similar to k-Means on the cluster update form cluster structure. The final part through experiments in this paper verified the effectiveness of EPC algorithm on artificial and real datasets respectively. The results show that the EPC algorithm not only can be carried out according to the clustering of environmental pollution sampling data similarity pollution characteristics, but also detect the abnormal points of them. (2) MulSim MulSim is defined to automatically adapt to changes in the data point density similarity function, if a point at the same time similar to another data point and the neighbor point the thought of these two data points belong to the same cluster. The experimental results show that the test data sets in arbitrary density, uniform density data set, cluster contains multiple center data sets, including spiral cluster data set, The globular clusters contain data sets, including arbitrary shape clustering data sets and multidimensional data sets such as various types of data sets, the quality of clustering is better than MulSim in most cases six contrast algorithm. (3) CLUB CLUB based on mutual k nearest neighbor method of initial clusters, then the second step algorithm as the initial cluster the input of the k nearest neighbor method to identify the main cluster based on density. Then, through the distribution of unlabeled data points to the nearest cluster where the density is larger than it to form the final cluster structure. Finally, from the internal point detect abnormal clusters. In the experimental section nine contains arbitrary shape and arbitrary density a two-dimensional data set, arbitrary scale clusters and seven multidimensional data sets are widely used, with three classic algorithms, three new algorithms are compared, the performance of CLUB is evaluated. And also the application of CLUB in Olivetti Face According to the set, demonstrating its effectiveness in face recognition. The experimental results show that CLUB is better than that in most cases the comparison algorithm. (4) SUM SUM with adjacent vertices of the public number of neighbors and small degree vertices of a similarity function. After similar to vertex placement in the same cluster, SUM questioned connection in a cluster of other vertices of maximum degree vertices, disconnect the cluster in the maximum degree vertex and its neighbor vertex connectivity, maximum degree vertex re distribution to obtain the initial cluster. Then, SUM will point to the initial cluster allocation has not been marked after adjusting the boundary point to form the final cluster. With four classic, two new graph clustering algorithm in the four true cluster structure shows four non real cluster experiments on graphs, SUM can accurately detect the cluster structure, and the result is better than the comparison algorithm. Four algorithms The time complexity is close to linear complexity. Therefore, these four algorithms can efficiently cluster analysis of their corresponding characteristic data sets with high accuracy.

【学位授予单位】:兰州大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP311.13

【相似文献】

相关期刊论文 前10条

1 ;数据集N鄽2[J];航空材料;1959年09期

2 江海洪 ,罗长坤;首套中国数字化可视人体数据集在第三军医大学研制成功[J];中华医学杂志;2003年09期

3 陈相颖;数据集记录快速定位与筛选方法之探讨[J];计量与测试技术;2005年06期

4 张晓斌;魏永祥;韩德民;夏寅;李希平;原林;唐雷;王兴海;;数字化耳鼻咽喉数据集的采集[J];中华耳鼻咽喉头颈外科杂志;2005年06期

5 王宏鼎;唐世渭;董国田;;数据集成中数据集特征的检测方法[J];中国金融电脑;2006年03期

6 张华;郁书好;;时空数据集的连接处理和优化方法研究[J];皖西学院学报;2006年02期

7 苗卿;单立新;裘昱;;信息熵在数据集分割中的应用研究[J];电脑知识与技术(学术交流);2007年05期

8 陈德诚;丘平珠;唐炳莉;;广西气象数据集设计与制作[J];气象研究与应用;2007年04期

9 赵凤英;王崇骏;陈世福;;用于不均衡数据集的挖掘方法[J];计算机科学;2007年09期

10 刘密霞;张秋余;赵宏;余冬梅;;入侵检测报警相关性及评测数据集研究[J];计算机应用研究;2008年10期

相关会议论文 前10条

1 田捷;;三维医学影像数据集处理的集成化平台[A];2003年全国医学影像技术学术会议论文汇编[C];2003年

2 范明;魏芳;;挖掘基本显露模式用于分类[A];第二十一届中国数据库学术会议论文集(技术报告篇)[C];2004年

3 冷传良;;飞机化铣成样板划线数据集设计方法探索[A];第十届沈阳科学学术年会论文集(信息科学与工程技术分册)[C];2013年

4 孟烨;张鹏;宋大为;王雷;;信息检索系统性能对数据集特性的依赖性分析[A];第十二届全国人机语音通讯学术会议(NCMMSC'2013)论文集[C];2013年

5 段磊;唐常杰;左R,

本文编号:1711149


资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/1711149.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户8ee60***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com