高维类别数据集的粗糙聚类算法的研究与应用

发布时间：2018-05-07 19:31

本文选题：信息熵 + 加权重叠距离　；参考：《大连海事大学》2017年硕士论文

【摘要】：聚类分析是数据挖掘的重要技术之一,所处理的数据分为数值型、类别型和混合型。针对数值型数据,聚类算法已经取得了非常卓越的成果。而对于类别数据,由于不能进行传统意义上的几何距离计算,所以有很多问题需要解决:比如,设计合理的差异度函数,探求有效的聚类初始化机制。大数据时代出现了高维海量数据,其属性个数达到几十、几百乃至上千个,它们通常具有不完备、不精确、不一致性等特征,传统聚类算法很难满足这些数据的聚类需求,但是,不断丰富的数据带来了更多有价值的信息。如何从高维数据中发掘到有用的信息,已成为当今聚类分析领域最前沿的研究课题;其中,设计高维数据下的"距离"度量成为一项严峻的任务。针对高维聚类,目前最为常见的方法主要有维度约简和子空间聚类。维度约简是解决高维数据聚类分析的特别有效的方法,降维方法主要包括特征变换和特征选择,特征选择是数据挖掘中常见的降维技术。到目前为止,对类别型数据的初始化问题研究较少,如果初始类中心选择的不合理,不仅得不到最佳的聚类簇,还会增加算法的复杂度。特别是高维类别数据,初始类中心的选择尤为重要。目前仍然没有一种被广泛接受的针对类别数据的初始类中心选择算法。因此,为高维类别数据聚类提出一种初始类中心选择算法是非常必要的。经典粗糙集的扩展模型,能够很好地处理不完备的、不精确的、有噪声的数据集。将扩展粗糙集方法运用到高维不完备的数据集的处理中,已经取得了一些很好的聚类算法。针对以上提出的问题,本文运用扩展的粗糙集模型——限制容差关系,对高维不完备的类别数据进行特征选择、设计聚类算法,主要工作包括以下两个部分:(1)针对高维类别不完备数据的特征选择:使用限制容差关系扩展粗糙集模型,重新定义信息熵以及条件信息熵,构造基于条件熵的高维类别不完备数据的维度约简算法CEHDAR。(2)基于加权重叠距离和加权平均密度的初始类中心选择算法:在算法中,我们使用限制容差关系的信息熵定义属性重要度,进而定义各属性的权重。在计算对象间的距离和对象的密度时,不同的属性被赋予相应的权重,从而体现不同属性对聚类贡献的不同。实验证明,相比于现有的聚类初始化方法,WDADI算法是最优的。然后,在UCI数据库的数据集上运行,证明了这种改进算法的有效性。
[Abstract]:Clustering analysis is one of the most important techniques in data mining. The data can be classified into numerical type, category type and mixed type. For the numerical data, the clustering algorithm has achieved remarkable results. However, for class data, there are many problems to be solved because of the traditional geometric distance calculation: for example, to design a reasonable difference function and to explore an effective clustering initialization mechanism. In the era of big data, there appeared massive high-dimensional data, whose attributes reached tens, hundreds or even thousands. They are usually incomplete, inaccurate, inconsistent and so on. The traditional clustering algorithm is difficult to meet the clustering needs of these data. But the growing wealth of data brings more valuable information. How to extract useful information from high-dimensional data has become the most advanced research topic in the field of clustering analysis, and the design of "distance" measurement under high-dimensional data has become a severe task. For high dimensional clustering, the most common methods are dimensionality reduction and subspace clustering. Dimension reduction is a very effective method to solve high dimensional data clustering analysis. Dimension reduction methods mainly include feature transformation and feature selection. Feature selection is a common dimensionality reduction technique in data mining. Up to now, there is little research on the initialization of class data. If the initial cluster center is not reasonable, it will not only get the best clustering cluster, but also increase the complexity of the algorithm. Especially for high-dimensional class data, the selection of initial class centers is particularly important. There is still no widely accepted initial class center selection algorithm for class data. Therefore, it is necessary to propose an initial cluster center selection algorithm for high dimensional data clustering. The extended model of classical rough sets can deal with incomplete, inexact and noisy data sets well. The extended rough set method has been applied to the processing of high dimensional incomplete data sets and some good clustering algorithms have been obtained. In order to solve the above problems, the extended rough set model-restricted tolerance relation is used to select the feature of high dimensional incomplete class data, and the clustering algorithm is designed. The main work includes the following two parts: (1) the feature selection for the incomplete data of high dimensional classes: using the restricted tolerance relation to extend the rough set model, redefining the information entropy and conditional information entropy. This paper constructs a dimensionality reduction algorithm based on conditional entropy for high dimensional class incomplete data CEHDAR. 2) an initial class center selection algorithm based on weighted overlap distance and weighted average density: in the algorithm, The information entropy of the restricted tolerance relationship is used to define the importance of attributes and then to define the weights of each attribute. When calculating the distance between objects and the density of objects, different attributes are given corresponding weights, which reflects the different contributions of different attributes to clustering. Experiments show that the WDADI algorithm is optimal compared with the existing clustering initialization method. Then, the improved algorithm is proved to be effective by running on the data set of UCI database.
【学位授予单位】：大连海事大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】