密度峰值聚类算法若干改进及地震分级应用研究

发布时间：2018-03-18 15:18

本文选题：密度峰值聚类算法　切入点：Halo点识别　出处：《吉林财经大学》2017年硕士论文　论文类型：学位论文

【摘要】：科技发展推动着人类社会从工业经济时代转变为信息经济时代,信息已成为当今社会的重要生产资源,而如何有效处理TB级容量存在的大数据库逐渐成为时下最值得关注的数据挖掘领域难题。聚类技术作为数据挖掘学习的重要工具,也成为了现今科学领域研究的热点。密度峰值聚类算法(Density Peak Clustering,DPC)于2014年在Science杂志上提出,迄今为止已得到各领域广泛认可。尽管如此,DPC算法仍然存在不足之处:(1)无法有效处理位于数据集低密度区域内的数据点,错误地将异常点、中间节点归类于簇类中;(2)人为参与选取簇类中心,降低了算法获取真实簇类的客观性和准确性;(3)无法有效处理复杂结构数据,在处理复杂流型、差异化密度、差异规模数据等复杂数据时表现不佳。鉴于上述问题本文提出不同的改进方案:(1)针对密度峰值聚类算法,无法有效处理位于数据集低密度区域内的数据点,错误地将异常点、中间节点归类于簇类中等问题,提出基于密度峰值算法的Halo点识别方法(An Improved Recognition Method on Halo Node for Density Peak Clustering Algorithm,HaloDPC)。通过引入经典DBSCAN算法的密度可达思想和SCAN算法的结构相似化模型,单独处理数据集低密度区域数据,挖掘出该数据区域内的隐藏信息。(2)针对密度峰值聚类人为参与选取簇类中心,降低了算法获取真实簇类的客观性和准确性的问题,提出基于密度峰值聚类的片段聚类法(Density Fragment Clustering without Peaks,DFC)。该算法将原始DPC聚类算法局部密度序列通过降序规则和截断距离分裂成Fragment片段,再以结构相似度为基础进行聚类,从而达到自动获取簇类中心的聚类过程。(3)针对密度密度峰值聚类难以处理复杂数据等问题,一种基于密度峰值的半监督近邻传播聚类算法(Semi-supervised Affinity Propagation based on Density Peaks,SAP-DP)。传统的近邻传播算法能实现对超球型数据,紧凑型数据的快速聚类。但该算法中吸引度信息与归属度信息过于紧密的联系,使得算法对流线型数据、复合型数据等复杂结构数据的处理过程过于单一化,导致该算法难以准确获取正确的类数,无法达到合理的效果。传统的密度峰值算法能够实现对任意形状的数据的聚类中心探索,因此本文将密度峰值聚类算法的优势引入近邻传播聚类分析中,充分利用密度峰值对复合型数据敏感的优点。并将半监督思想引入算法改进中,实现两个算法的有效结合,为了更好地融合近邻传播算法与密度峰值算法的优势,本文基于半监督思想建立成对约束条件,利用两种约束信息的互相传递,更新聚类相似度矩阵,提高算法的运行效率和准确率。(4)拓展本文改进算法的应用领域,将改进算法应用于国家地震数据分级测试中,仿真实验表明改进的算法能高效精确地测算地震震级,在实际应用领域具有极大的潜力,同时深度挖掘改进算法对于实际数据应用中的优缺点,为进一步完善和提高算法准确率和实用性提供了依据。
[Abstract]:The development of science and technology to promote the society transition from industrial economy to information economy era, information has become an important resource in today's society, and how to effectively handle large databases TB level capacity has become nowadays the most concern of the field of data mining clustering problem. Data mining technology is an important tool for learning, has become a hot research area now in the field of science. The peak density clustering algorithm (Density Peak Clustering, DPC) in 2014 in the Journal Science, so far has been widely recognized in various fields. However, the DPC method still has shortcomings: (1) can not effectively deal with the data set is located in the low density region of the data points, mistakenly outliers, intermediate nodes classified in clusters; (2) the human selected cluster center, reduces the clustering algorithm to obtain the true objectivity and accuracy; (3) to The complex structure of data, in dealing with the complex flow pattern, the difference between the size of the data density, complex data poor performance. In view of the above problems this paper puts forward different solutions: (1) according to the density clustering algorithm can not effectively deal with the data set in the low density region of the data points, mistakenly outliers, intermediate according to the cluster node medium, Halo point recognition algorithm based on peak density (An Improved Recognition Method on Halo Node for Density Peak Clustering Algorithm, HaloDPC). By introducing the structure of classical DBSCAN algorithm and SCAN algorithm idea density similarity model, data set low density area data alone, mining the hidden information of the data within the region. (2) according to the density clustering cluster center selection of human involvement, reduced the algorithm clusters and obtain true objectivity The accuracy of the problem, proposed fragment clustering method based on density clustering (Density Fragment Clustering without Peaks, DFC). This algorithm will be the original DPC clustering algorithm of local density sequence by descending rule and the cutoff distance split into Fragment fragments, then the structure similarity based clustering, clustering process so as to achieve the automatic acquisition of cluster center. (3) the peak density clustering problem is difficult to deal with complex data, a semi supervised affinity propagation clustering algorithm based on density peak (Semi-supervised Affinity Propagation based on Density Peaks, SAP-DP). The traditional affinity propagation algorithm can achieve data on the super ball, compact and fast clustering data. But the algorithm in attracting the degree of information and membership information too closely, which makes the algorithm of convection type data, complex data structures data processing The process is too simple, the algorithm is difficult to accurately obtain the correct class number, cannot achieve reasonable results. The peak density of traditional algorithm can realize arbitrary shape clustering centers on data exploration, so this paper will introduce the advantages of density clustering algorithm of affinity propagation clustering, make full use of the advantages of the peak density of composite data sensitive. And introduces the idea of semi supervised algorithm, realize the effective combination of the two algorithms, in order to better integrate the affinity propagation algorithm and density peak algorithm has the advantage of the semi supervised thought establish pairwise constraints based on mutual transfer by using two kinds of constraint information, update the clustering similarity matrix, improve the efficiency of the algorithm and accuracy. (4) the expansion of the application of the improved algorithm, the improved algorithm is applied to the national seismic data classification test, simulation experiments show that The improved algorithm can estimate the magnitude of earthquake accurately and accurately, and has great potential in practical application. At the same time, we deeply explore the advantages and disadvantages of the improved algorithm for the actual data, and provide a basis for further improving and improving the accuracy and practicability of the algorithm.

【学位授予单位】：吉林财经大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】