高维分类数据聚类方法研究与实现

发布时间：2018-04-17 02:36

本文选题：分类数据 + 子空间聚类　；参考：《东华大学》2017年硕士论文

【摘要】：聚类分析作为一种无监督的机器学习方法,根据一定的规则,将原本杂乱无章的数据分成一系列簇,使得每个簇由相似度较高的数据组成,这为后续的数据分析提供了极大的便利,被广泛地应用于网络服务、地理、生物、贸易等多个领域。但随着数据产生渠道及数据收集技术的发展,用于分析的数据维度及复杂度也越来越大,传统的数据聚类算法在这些数据集上无法取得较好的聚类结果。软子空间聚类作为高维数据聚类领域的一个研究热点,受到人们越来越多的关注。但针对分类数据,目前已有的软子空间聚类算法大多都是基于k-modes算法的扩展,其数据间相似性的计算及属性(也称为特征)的权值计算都依赖类中心(modes)选择,从而modes选的好坏直接影响了最终的聚类质量。同时,现有的软子空间聚类算法在聚类时对缺失数据和完整数据不加以区分,也很大程度上影响了最终的聚类结果。针对高维不完整的分类数据,本文将基于簇直方图高宽比聚类思想的CLOPE算法应用于软子空间聚类,并提出了一个新的软子空间聚类算法。首先,结合粗糙集提出了一个缺失数据处理方法,来处理数据集中的缺失数据,同时,根据属性的平均互信息对属性加权;然后,针对CLOPE算法的聚类质量受数据输入顺序影响的问题,提出了对数据完全随机排序的洗牌模型‖来最大程度消除数据输入顺序对最终聚类质量的影响;最后,利用Scala语言在Spark平台上实现了该算法,使其能用于大规模数据的聚类。本文选择UCI中的真实数据作为本文的实验数据,进行了4组实验,分别用来验证洗牌模型及属性加权方法的有效性、缺失数据处理方法的有效性、本文提出的软子空间算法的有效性及对数据规模的可扩展性。实验结果表明,本文算法(未使用缺失数据处理方法的版本)的聚类质量明显优于CLOPE。与最大频率填补和不做处理这两种方式相比,随着数据缺失率的增加,本文提出的缺失数据处理方法的优势也越明显。与另外两个典型的针对分类数据的软子空间聚类算法相比,无论是从聚类质量还是运行时间上,本文算法都有明显的优势。
[Abstract]:Clustering analysis as an unsupervised machine learning method, according to certain rules, the original data is divided into a series of clusters, so that each cluster is composed of data with high similarity.This provides great convenience for subsequent data analysis and is widely used in many fields, such as network services, geography, biology, trade and so on.However, with the development of data generation channel and data collection technology, the dimension and complexity of data used for analysis are increasing, and the traditional data clustering algorithm can not obtain better clustering results on these data sets.Soft subspace clustering, as a research hotspot in the field of high dimensional data clustering, has attracted more and more attention.However, for classified data, most of the existing soft subspace clustering algorithms are based on the extension of k-modes algorithm, and the calculation of similarity between data and the weight calculation of attributes (also called features) depend on the selection of class center.Therefore, the quality of modes selection has a direct impact on the final clustering quality.At the same time, the existing soft subspace clustering algorithms do not distinguish the missing data from the complete data in clustering, and to a large extent affect the final clustering results.In this paper, CLOPE algorithm based on cluster histogram aspect ratio clustering is applied to soft subspace clustering, and a new soft subspace clustering algorithm is proposed.Firstly, a missing data processing method based on rough set is proposed to deal with the missing data in the dataset. At the same time, the attributes are weighted according to the average mutual information of the attributes.Aiming at the problem that the clustering quality of CLOPE algorithm is affected by the order of data input, a shuffling model of complete random sorting of data is proposed to eliminate the effect of data input order on the final clustering quality to the greatest extent.The algorithm is implemented on Spark platform by using Scala language, which can be used for large scale data clustering.In this paper, the real data in UCI is chosen as the experimental data, and four groups of experiments are conducted to verify the validity of shuffling model and attribute weighting method, and the validity of missing data processing method.In this paper, the validity of soft subspace algorithm and its scalability to data scale are discussed.The experimental results show that the clustering quality of the proposed algorithm (not using the version of missing data processing method) is obviously superior to that of CLOPE.Compared with the maximum frequency filling method and the non-processing method, the advantages of the proposed missing data processing method are more obvious with the increase of the data loss rate.Compared with the other two typical soft subspace clustering algorithms for classified data, this algorithm has obvious advantages in terms of clustering quality and running time.
【学位授予单位】：东华大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181;TP311.13

【参考文献】