高维数据流快速降维聚类算法研究

发布时间：2019-05-18 08:09

【摘要】：数据的爆炸式增长,使得从数据中发现有价值的信息并将其转化为有组织的知识变得更加困难,于是数据挖掘应运而生。而作为数据挖掘的重要研究方法之一,聚类分析在许多领域被广泛使用。而随着信息技术的不断发展,数据流成为了一种新的数据类型,并逐渐成为主流。于是对数据流的聚类算法的研究变得热门而富有意义。高维数据流聚类算法包括降维和聚类两个部分,本文分别针对已有的降维算法和聚类算法中存在的不足,提出了自己的改进算法,并用实验证明了改进算法的优势。本文在别人的基础上,针对高维数据流子空间降维算法无法根据数据流的动态变化自动调整降维结果和需要多次扫描数据流的问题,提出了基于结构树的高维数据流子空间自适应降维算法。该算法通过改进相对熵寻找区域的相关维,继而建立起对应的子空间,并在子空间中实现聚类,确保了不同的区域对应不同的子空间。利用相对熵寻找区域相关维相对于孙玉芬的GSCDS算法更简单更自然。同时使用结构树保存划分过程相关信息,并结合回溯算法的思想,实现了对高维数据流子空间聚类算法的自适应功能,避免了算法每次面对新数据都需要重新运行子空间算法的尴尬,衰减因子的使用也避免了旧数据对聚类结果的过度影响。实验结果表明算法以较小的时间复杂度取得了较高的聚类质量。将基于网格的聚类算法应用在降维结果的聚类处理中保留了网格算法高效,自适应能力强的优点,但网格的划分导致类边缘精度低下的问题,影响了聚类质量,于是本文针对基于网格的数据流聚类算法存在的簇边缘精度低下以及需要多次扫描网格才能实现聚类的问题,提出了一种改进的数据流聚类算法。该算法主要有两个方面的改进:首先在初始聚类阶段采用从内到外、从点到面的方法实现了通过一次性扫描网格完成聚类以解决原算法中反复扫描网格造成的效率低下的问题;然后通过寻找最大密度相连集合来最大限度地区分边缘地区的噪声点和有用点,以解决原算法中边缘点缺失的问题。最后通过实验证明,本文所改进的算法对提高类边缘精度具有很好的效果,且对数据的分布具有较好的适应性。
[Abstract]:With the explosive growth of data, it is more difficult to find valuable information from data and transform it into organized knowledge, so data mining emerges as the times require. As one of the important research methods of data mining, clustering analysis is widely used in many fields. With the continuous development of information technology, data flow has become a new data type, and gradually become the mainstream. Therefore, the research on clustering algorithm of data flow becomes hot and meaningful. The clustering algorithm of high-dimensional data flow includes two parts: reduction and clustering. In this paper, aiming at the shortcomings of the existing dimensionality reduction algorithm and clustering algorithm, an improved algorithm is proposed, and the advantages of the improved algorithm are proved by experiments. In this paper, on the basis of others, the high-dimensional data carrier space dimension reduction algorithm can not automatically adjust the dimensionality reduction results according to the dynamic changes of the data stream and needs to scan the data stream many times. An adaptive dimension reduction algorithm for high dimensional data carrier space based on structure tree is proposed. By improving the relative entropy to find the correlation dimension of the region, the algorithm establishes the corresponding subspace, and implements clustering in the subspace to ensure that different regions correspond to different subspaces. Using relative entropy to find regional correlation dimension is simpler and more natural than Sun Yufen's GSCDS algorithm. At the same time, the structure tree is used to save the relevant information of the partition process, and combined with the idea of backtracking algorithm, the adaptive function of high dimensional data carrier space clustering algorithm is realized. It avoids the embarrassment that the algorithm needs to rerun the subspace algorithm every time it faces the new data, and the use of the attenuation factor also avoids the excessive influence of the old data on the clustering results. The experimental results show that the algorithm achieves high clustering quality with small time complexity. The clustering algorithm based on grid is applied to the clustering processing of dimension reduction results, which preserves the advantages of efficient grid algorithm and strong adaptive ability, but the classification of grid leads to the problem of low precision of class edge, which affects the clustering quality. In this paper, an improved data flow clustering algorithm is proposed to solve the problems of low cluster edge accuracy and multiple scanning of grid to realize clustering in grid-based data flow clustering algorithm. The algorithm is mainly improved in two aspects: firstly, in the initial clustering stage, the method from inside to outside and from point to surface is used to complete clustering by scanning grid at one time to solve the problem of low efficiency caused by repeatedly scanning grid in the original algorithm; Then, by finding the maximum density connected set to distinguish the noise points and useful points in the edge area to the maximum extent, the problem of missing edge points in the original algorithm can be solved. Finally, the experimental results show that the improved algorithm has a good effect on improving the edge accuracy of the class, and has a good adaptability to the distribution of data.
【学位授予单位】：长沙理工大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】