基于DNA遗传算法的聚类分析研究与应用

发布时间：2018-09-18 21:33

【摘要】：聚类分析是模式识别中无监督分类的一个非常重要的分支,因为对现实问题研究的需要,近几年来对聚类分析的研究越来越多,相对应的聚类方法也逐渐增多。鉴于现实问题在分类过程的模糊特征,基于目标函数的模糊C均值聚类便逐渐被广泛应用。现在学者们也开始将聚类问题转化成图论问题,在图的基础上进行聚类的谱聚类也是当前聚类研究的一个重要的方向。两种聚类算法都是当下研究的热点,但是无论哪一种都不是通用的,目前各自都存在一些缺陷,为了弥补算法的不足,便可以借助一些智能优化算法对其进行优化,提高聚类算法的性能。本文主要研究的就是模糊C均值聚类算法、谱聚类算法以及对两种算法进行优化的DNA遗传算法。在当前的模糊聚类算法中,模糊C均值聚类(Fuzzy C-means Clustering,简称FCM)因其具有较好的局部搜索能力并且在执行过程中操作简便高效而被广泛应用,但是模糊C均值聚类算法也有一些固有的缺陷和不足,第一:算法本身的隶属度和为1的限定条件易造成算法对数据点中噪声和离群点比较敏感,第二:算法对初始聚类中心的选取非常敏感并容易陷入局部最优。本文为了克服模糊C均值聚类算法的各项不足,整体上采用改进的隶属度计算方式以期降低噪声和离群点对聚类结果的影响,同时加入密度计算来优化FCM算法对初始聚类中心敏感的不足,另外还使用了改进之后的DNA遗传算法协助FCM算法跳出局部最优,最终寻得全局最优解。谱聚类(Spectral clustering,简称SC)算法是建立在图论的谱图理论基础上,其本质就是把聚类问题转换成图划分问题,是典型的点对聚类算法。谱聚类算法能够可以有效的降低计算的复杂度,同时能够保证聚类的质量,但是谱聚类目前是一新兴领域,在很多的地方仍然存在着不足,谱聚类自身可以改进的地方有相似度矩阵的构建、特征值和特征向量的选取以及最终实现聚类的过程。初始的谱聚类在相似度矩阵构建中使用的是基于欧氏距离的高斯核函数,导致在进行相似度矩阵构建时,会受到高斯核参数σ不确定的影响,所以本文采用了基于调整相似度系数的相似度矩阵构建方法创建相似度矩阵,无需人工设定参数,所得结果会更加符合真实情况。另外,谱聚类在进行最终的聚类过程中一般采用的都是K-means聚类方法,只是K-means方法自身存在着对初始聚类中心敏感,且容易陷入局部最优的缺点,所以本文在最终的聚类过程中用改进的DNA遗传算法优化K-means聚类算法,以期更好的完成谱聚类过程。DNA遗传算法与遗传算法有着很大的相似之处,区别在于DNA遗传算法采用了特殊的DNA编码方式对种群个体进行遗传操作进而得到问题的解。DNA遗传算法特有的四进制编码方式,能够更灵活的表示更复杂的信息,编码精度更高。同时DNA遗传算法还具有优良的全局搜索能力及隐性并行性的特点,在本文中,我们正好借助DNA遗传算法的优势来对模糊C均值算法和谱聚类算法进行优化,以解决两类算法本身存在的不足。另外为了提高优化效果,本文还对DNA遗传算法的多个遗传算子进行了相应的改进。本文通过MATLAB进行仿真和实验,首先使用测试函数以及人工数据集证明了改进的DNA遗传算法的可行性及有效性,然后使用UCI数据集分别对提出的改进模糊C均值聚类算法和改进的谱聚类算法进行实验效果验证,证明算法的有效性。同时将改进的模糊C均值算法用于搜狗实验室语料库的文本分类中,实验分类结果有效证明了改进算法的有效性。
[Abstract]:Clustering analysis is a very important branch of unsupervised classification in pattern recognition. In recent years, more and more researches have been done on clustering analysis and the corresponding clustering methods have been gradually increased because of the need for practical problems. Nowadays, scholars have begun to transform clustering problem into graph theory problem. Spectral clustering based on graph is also an important direction of current clustering research. Both clustering algorithms are hot topics, but neither of them is universal. At present, there are some defects in each of them. To make up for the shortcomings of the algorithm, some intelligent optimization algorithms can be used to optimize it and improve the performance of the clustering algorithm. This paper mainly studies the fuzzy C-means clustering algorithm, the spectral clustering algorithm and the DNA genetic algorithm which optimizes the two algorithms. Stering (FCM) is widely used because of its good local search ability and easy and efficient operation in the process of execution. But fuzzy C-means clustering algorithm also has some inherent shortcomings and shortcomings. First, the membership degree of the algorithm itself and the limitation of 1 are easy to make the algorithm sensitive to noise and outliers in the data points. In order to overcome the shortcomings of the fuzzy C-means clustering algorithm, the improved membership calculation method is adopted to reduce the influence of noise and outliers on the clustering results, and the density calculation is added to optimize the FCM algorithm for the initial clustering. In addition, the improved DNA genetic algorithm is used to help FCM algorithm jump out of the local optimum and find the global optimum. Spectral clustering (SC) algorithm is based on the spectral theory of graph theory. Its essence is to transform clustering problem into graph partitioning problem, which is a typical point-to-point clustering problem. Spectral clustering algorithm can effectively reduce the computational complexity and ensure the quality of clustering, but spectral clustering is a new field, there are still some shortcomings in many places, spectral clustering itself can be improved in the construction of similarity matrix, the selection of eigenvalues and eigenvectors and the final reality. The initial spectral clustering uses the Gaussian kernel function based on Euclidean distance in the construction of similarity matrix, which results in the uncertainties of Gaussian kernel parameter_when constructing similarity matrix. In addition, the K-means clustering method is generally used in the final clustering process, but the K-means clustering method itself is sensitive to the initial clustering center and easy to fall into local optimum shortcomings, so in the final clustering process, this paper uses a modification. The improved DNA genetic algorithm optimizes K-means clustering algorithm in order to complete the spectral clustering process better. DNA genetic algorithm and genetic algorithm have great similarities. The difference is that DNA genetic algorithm uses a special DNA coding method to conduct genetic operations on individuals and get the solution of the problem. At the same time, DNA genetic algorithm has excellent global search ability and implicit parallelism. In this paper, we use the advantages of DNA genetic algorithm to optimize the fuzzy C-means algorithm and spectral clustering algorithm to solve the problem of the existence of two kinds of algorithms. In addition, in order to improve the optimization effect, several genetic operators of DNA genetic algorithm are improved correspondingly. In this paper, MATLAB is used to simulate and experiment. Firstly, test function and artificial data set are used to prove the feasibility and validity of the improved DNA genetic algorithm. The improved fuzzy C-means clustering algorithm and the improved spectral clustering algorithm are used to verify the validity of the algorithm. At the same time, the improved fuzzy C-means algorithm is applied to the text classification of the dog laboratory corpus. The experimental results show that the improved algorithm is effective.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】