基于数据质量与势熵的聚类算法研究

发布时间：2017-12-27 01:02

本文关键词：基于数据质量与势熵的聚类算法研究　出处：《武汉大学》2016年博士论文　论文类型：学位论文

【摘要】：随着计算机科学的发展,人类社会已经进入到大数据时代。在大数据时代,数据分析技术成为了利用大数据资源的关键工具,能够发现数据中的价值,就能够在大数据时代占据先机。数据挖掘作为数据分析的关键技术,在大数据时代有着广泛的应用前景。数据挖掘能够发现数据中隐藏的知识,充分利用数据资源,在一定程度上解决数据庞大而知识匮乏的问题。在数据挖掘中,主要有三种分析方式,分类,关联和聚类。分类和关联在机器学习中属于监督型学习算法,聚类属于非监督型学习算法。在大数据时代,往往强调全数据集的挖掘和学习,并且很难有合适的训练集对算法进行训练。因此,非监督学习算法更适合大数据时代的背景,聚类分析也成为数据挖掘的研究热点。本文针对数据挖掘中的聚类问题,提出了矢量数据场的理论、数据场数据质量的新概念、数据质量聚类算法、基于势熵的峰值密度聚类算法。并使用人脸表情识别和人脸自动聚类两种实例对相关的理论和方法进行了检验。首先,数据场是一种分析数据的模型,经典的数据场理论通过势能描述数据在数据集中的分布情况。本文在其基础上,提出了矢量数据场的概念,让数据场不仅能描述数据的分布,还可以描述数据的运动趋势,并通过哈密顿算子统一了矢量数据场和数量数据场的模型。其次,数据场的概念来源于物理场,而物体在物理场中有质量,因此,数据在数据场中也应有质量。本文提出了数据质量的新概念,即代表数据在数据集中的固有属性,并随着挖掘视角的改变而变化,其本质是衡量数据在特定挖掘视角下的权值。对于数据场中不随挖掘视角改变的属性,本文提出了数据场基本矩阵的概念,并建立起数据场基本矩阵、数据质量和数据势能的线性方程组。数据场基本矩阵进一步将数据场的计算矩阵化,并在此基础上提出数据最优质量的内凸点解法,解决了经典数据场理论求解最优数据质量受初始点选取影响的问题。在势能与质量的方程组基础上,结合“学习机”的思想,提出了基于非齐次线性方程组的最优数据质量的求解方法,提高了数据质量求解的效率。然后,在数据质量的基础上,提出了数据质量聚类算法。让数据质量代表数据的密集程度,找到聚类中心,并通过一次迭代完成聚类。该方法解决了传统划分聚类算法聚类中心确定不准确,需要提前输入聚类个数等问题。对于《Science》上发表的“峰值密度聚类算法”,需要手动设定阈值的问题,提出了基于势熵的峰值密度聚类算法。该方法基于香农熵与聚类不确定性之间的关系,建立起香农熵与阈值之间的关系函数,由此来确定每个数据集所对应的最佳阈值,提高了聚类算法的普适性。最后,通过人脸表情识别和人脸自动聚类对新理论,新概念和新方法进行了检测。结果表明,数据质量能够很好地反映出像素点在人脸表情中的权值,并能构建出较好的人脸表情特征脸,得到理想的识别结果。而数据质量聚类算法和基于势熵的峰值密度聚类算法在人脸自动聚类中能够得到优于峰值密度聚类算法和DBSCAN等经典聚类算法的结果。
[Abstract]:With the development of computer science, human society has entered the era of big data. In the era of big data, data analysis technology has become a key tool to use big data resources, and it can find the value in data, and it will take the initiative in the era of big data. As the key technology of data analysis, data mining has a wide application prospect in the era of large data. Data mining can discover the hidden knowledge in the data, make full use of the data resources, and solve the problem of large data and lack of knowledge to some extent. In data mining, there are three main types of analysis, classification, association and clustering. Classification and association are supervised learning algorithms in machine learning, and clustering is an unsupervised learning algorithm. In the era of large data, the mining and learning of the full data set is often emphasized, and it is difficult to train the appropriate training set for the algorithm. Therefore, the unsupervised learning algorithm is more suitable for the background of the large data age, and clustering analysis has become a hot topic in the research of data mining. Aiming at the clustering problem in data mining, this paper proposes vector data field theory, new concept of data field quality, data quality clustering algorithm and peak density clustering algorithm based on potential entropy. Two examples of facial expression recognition and automatic face clustering are used to test the related theories and methods. First, the data field is a model of data analysis. The classical data field theory describes the distribution of data in the data set through potential energy. Based on it, we put forward the concept of vector data field, so that data field can not only describe the distribution of data, but also describe the trend of data movement, and integrate the vector data field and quantitative data field model by Hamiltonian operator. Secondly, the concept of the data field comes from the physical field, and the object has the mass in the physical field. Therefore, the data should also have the quality in the data field. This paper proposes a new concept of data quality, that is, the inherent attributes representing data in data sets, and changes with the change of mining perspective. The essence of data is to weigh data in a specific mining perspective. For data fields that do not change with the mining perspective, the concept of data field basic matrix is proposed, and the linear equations of data field basic matrix, data quality and data potential energy are established. The basic matrix of data field further matrixes the computation of data field, and on this basis, we propose the solution of the interior convex point of the best quality of data, and solve the problem that the optimal data quality of classical data field is affected by the initial point selection. Based on the equations of potential energy and mass, combined with the idea of learning machine, a method of solving the optimal data quality based on non-homogeneous linear equations is proposed, which improves the efficiency of data quality solving. Then, on the basis of data quality, a data quality clustering algorithm is proposed. The data quality represents the intensity of the data, and the clustering center is found and the clustering is completed by one iteration. This method solves the problem that the clustering center of the traditional partition clustering algorithm is inaccurate and needs to enter the number of clustering in advance. For the "peak density clustering algorithm" published in "Science", it is necessary to manually set the threshold problem, and a peak density clustering algorithm based on potential entropy is proposed. Based on the relationship between Shannon entropy and clustering uncertainty, this method establishes the relationship function between Shannon entropy and threshold, so as to determine the optimal threshold for each dataset and improve the universality of clustering algorithm. Finally, the new theory, new concept and new method are detected by facial expression recognition and face automatic clustering. The results show that the quality of data can well reflect the weight of pixels in facial expression and construct a better facial expression feature face, and get the ideal recognition result. The data quality clustering algorithm and the potential density based peak density clustering algorithm can get better results than the peak density clustering algorithm and DBSCAN and other classical clustering algorithms.
【学位授予单位】：武汉大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】