基于组织型P系统的DNA-GA算法研究及其在聚类中的应用

发布时间：2018-05-08 00:39

本文选题：P系统 + DNA-GA　；参考：《山东师范大学》2017年硕士论文

【摘要】：DNA-GA算法本质上是建立在DNA编码上的遗传算法,是将进化计算领域和DNA计算相结合的一种表现形式。DNA-GA算法所采用的DNA编码方式与传统的二进制编码相比较起来更加灵活,并且还可以进行较多的遗传操作,这就使得DNA-GA算法相对于遗传算法来说,可以表达更多的遗传信息。所以DNA-GA算法能够在更大程度上克服GA算法所存在的某些局限问题,比如算法的早熟收敛、二进制海明悬崖问题等,因此DNA-GA近些年受到学者们的广泛关注。当下设计出更有效的DNA-GA算法,为人类研究做出贡献,具有很强的理论和现实意义。膜计算又称P系统,是从生物细胞、组织或器官的功能和结构中抽象出来的具有分布式的并行计算模型。从计算效率角度来看,P系统能够在线性时间内求解NP难问题,因此能够在计算智能方面为人们提供较多的方便。到目前为止,膜计算已被广泛应用于众多领域,例如:计算机科学,生物学,语言学,近似优化,计算机图形学,经济学,密码学等。膜计算的应用研究相对于理论方面研究,目前尚处于初级阶段,学者们期待P系统在应用领域上会有突破性进展。聚类分析属于无监督学习的一种技术,也就是说本身具有独立的学习能力。聚类的整个过程可以描述为:将整个数据空间中的每个对象根据欧式距离分别划分到不同的簇中,距离较近的对象会被划分到相同的簇中,反之距离较远的对象会被划分到不同的簇中,最终使得同一类中的对象尽可能地相似而不同类中的对象尽可能地不同。随着聚类分析的研究发展,其在模式分析、机器学习、数据挖掘、文档检索、图像分割、模式识别等领域都有十分广泛的应用。本文就是在以上所述的理论前提下,以膜计算模型中的组织型P系统为基础,提出了基于组织型P系统的DNA-GA算法(TPDNA-GA)。主要涉及三部分的创新:一、对基本DNA-GA算法中涉及的遗传操作进行部分修改,提出了基于新型重构交叉算子的改进DNA-GA算法;二、将改进后的DNA-GA算法与组织型P系统相结合,结合的主要目的是利用组织型P系统的极大并行性和膜规则来提高DNA-GA的性能,其中包括了对适应度函数的定义及膜规则的改进,从而寻找到等待处理的数据集的最佳聚类结果。并且本文利用三个标准测试函数对所提出新算法的性能进行了有效性验证;三、将TPDNA-GA算法与K-means相结合进行了相关研究与对比分析,并利用标准测试集进行了算法性能分析;最后本文将该TPDNA-GA算法的聚类过程应用在处理Web文档中,提出了具体的文档聚类应用过程,并且利用Reuters-21578中的数据进行实验,对聚类精确度进行验证和比较,证明该算法能够为人们在日常工作中查询文档提供方便。
[Abstract]:The DNA-GA algorithm is essentially a genetic algorithm based on DNA coding. It is a representation of evolutionary computing and DNA computation, which is more flexible than the traditional binary coding. And more genetic operations can be carried out, which makes the DNA-GA algorithm can express more genetic information than the genetic algorithm. Therefore, DNA-GA algorithm can overcome some limitations of GA algorithm to a greater extent, such as the premature convergence of the algorithm, binary Hemming Cliff problem and so on. Therefore, DNA-GA has been widely concerned by scholars in recent years. It is of great theoretical and practical significance to design a more effective DNA-GA algorithm to contribute to human research. Membrane computing, also called P system, is a distributed parallel computing model abstracted from the functions and structures of biological cells, tissues or organs. From the point of view of computational efficiency, the P / P system can solve NP-hard problems in linear time, so it can provide more convenience for people in computing intelligence. Up to now, membrane computing has been widely used in many fields, such as computer science, biology, linguistics, approximate optimization, computer graphics, economics, cryptography and so on. Compared with the theoretical research, the application of membrane computing is still in its infancy, and scholars expect that there will be a breakthrough in the application of P system. Clustering analysis is a kind of unsupervised learning technology, that is to say, it has independent learning ability. The whole process of clustering can be described as: each object in the whole data space is divided into different clusters according to the Euclidean distance, and the objects close to each other are divided into the same cluster. On the other hand, objects far away will be divided into different clusters, making objects in the same class as similar as possible and objects in different classes as different as possible. With the development of clustering analysis, it has been widely used in the fields of pattern analysis, machine learning, data mining, document retrieval, image segmentation, pattern recognition and so on. In this paper, on the basis of the tissue P system in the membrane computing model, the DNA-GA algorithm based on the tissue P system is proposed. It mainly involves the innovation of three parts: first, the genetic operation involved in the basic DNA-GA algorithm is partly modified, and an improved DNA-GA algorithm based on the new reconstruction crossover operator is proposed; second, the improved DNA-GA algorithm is combined with the organizational P system. The main purpose of the combination is to improve the performance of DNA-GA by using the maximal parallelism and membrane rules of the tissue P system, including the definition of fitness function and the improvement of membrane rules, so as to find the best clustering result of the data set waiting for processing. Three standard test functions are used to verify the performance of the proposed algorithm. Thirdly, the TPDNA-GA algorithm and K-means are studied and compared, and the performance of the algorithm is analyzed by using the standard test set. Finally, this paper applies the clustering process of the TPDNA-GA algorithm to the processing of Web documents, proposes a specific document clustering application process, and makes use of the data in Reuters-21578 to carry out experiments to verify and compare the clustering accuracy. It is proved that the algorithm can provide convenience for people to query documents in their daily work.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：Q811.4;TP311.13

【参考文献】