基于机器学习的蛋白质结构类预测与质量评估

发布时间：2018-05-14 22:26

本文选题：蛋白质结构类 + SVM　；参考：《河南师范大学》2017年硕士论文

【摘要】：蛋白质是构成单个细胞的基本有机物,是生命活动的执行者,其角色决定于它的功能,而蛋白质功能主要由他们的结构决定,因此研究蛋白质的结构对于认识其功能具有很大的意义。但由于在生物体内蛋白质的组成复杂多样,直接使用分子动力学技术模拟蛋白质折叠过程,不仅需要大量的计算资源,还需要对蛋白质折叠过程有深刻的认识,很难快速准确的实现结构预测和模型质量评估。随着计算机信息技术的发展,研究基于机器学习(Machine Learning,ML)的蛋白质结构类预测和质量评估是目前生物信息领域的一个研究热点。本论文的主要研究内容包括以下三个方面:(1)构筑基于属性约减的蛋白质结构类多分类模型。在蛋白质结构类分类预测中,首先对于已知氨基酸序列的蛋白质,选择不易丢失序列信息的伪氨基酸特征,然后针对蛋白质序列特征表达存在信息冗余,考虑到结构类分类是个多分类问题,提出利用Relief F算法对蛋白质结构特征进行约减,接着采用多个二分类的SVM模型来构造SVM多分类器模型,最后对蛋白质结构类进行分类,尽管实验结果和未进行特征约简的方法相比,耗费的时间减少近一半,但存在模型参数不好确定的问题。(2)设计SAPSO算法,优化蛋白质结构类分类模型参数。针对上述蛋白质结构类多分类模型参数不好确定的问题,综合模拟退火(Simulated Annealing,SA)算法跳出局部最优解和粒子群(Particle Swarm optimization,PSO)算法收敛速度快的特点,设计出一种适合蛋白质分类模型的模拟退火粒子群(SAPSO)算法以获取优化的模型参数,然后通过具体的蛋白质分类实验,证明设计方法的有效性。(3)针对传统蛋白质模型质量评估没有考虑同源信息问题的缺陷,建立了一种基于ML的蛋白质模型质量评估模型。将蛋白质序列输入到SWISS-MODEL中,自动构造出它的三维结构。将蛋白质序列和Model1序列输入到BLAST系统中,提取序列比对的四个主要特征。在考虑同源信息的情况下,将提取的特征值作为LS-SVM的输入数据用来训练LS-SVM,并同时利用SAPSO算法对LS-SVM的参数寻优。由最优参数值构造的LS-SVM模型来得到蛋白质GDT-TS。然后通过测试实验表明该设计模型在绝对误差和均方误差方面均有明显优势,进而证明所建模型的合理性和有效性。
[Abstract]:Protein is the basic organic substance that makes up a single cell. It is the executor of life activity. Its role is determined by its function, and the function of protein is mainly determined by their structure. Therefore, it is of great significance to study the structure of proteins for understanding their functions. However, due to the complexity and diversity of protein composition in organisms, direct use of molecular dynamics technology to simulate protein folding process requires not only a large number of computational resources, but also a profound understanding of protein folding process. It is difficult to realize structure prediction and model quality evaluation quickly and accurately. With the development of computer information technology, the research of protein structure class prediction and quality evaluation based on machine learning (ML) is a hot topic in the field of biological information. The main contents of this thesis include the following three aspects: 1) to construct a multi-classification model of protein structure based on attribute reduction. In the classification and prediction of protein structural classes, the pseudo amino acid features of the known amino acid sequences are selected for the known amino acid sequences, and then there is information redundancy for the protein sequence feature expression. Considering that structural class classification is a multi-classification problem, Relief F algorithm is proposed to reduce the structural features of proteins. Then, the multi-classifier model of SVM is constructed by using a number of two-classification SVM models. Finally, the protein structural classes are classified. Although the experimental results are less than half of the time consumed by the method without feature reduction, there is a problem that the model parameters are difficult to determine. (2) the SAPSO algorithm is designed to optimize the parameters of the protein structure class classification model. In view of the difficulty of determining the parameters of the multi-classification model of protein structure, synthetic simulated annealing (SA) algorithm can jump out of the local optimal solution and the particle swarm optimization (PSO) algorithm converges quickly. A simulated annealing particle swarm optimization (SAPSO) algorithm suitable for protein classification model was designed to obtain the optimized model parameters. It is proved that the design method is effective. (3) aiming at the defect that the traditional protein model quality evaluation does not consider the problem of homology information, a protein model quality evaluation model based on ML is established. The protein sequence is input into SWISS-MODEL and its three-dimensional structure is constructed automatically. Protein sequences and Model1 sequences were input into the BLAST system to extract the four main features of sequence alignment. When the homologous information is considered, the extracted eigenvalues are used as input data of LS-SVM to train LS-SVM, and SAPSO algorithm is used to optimize the parameters of LS-SVM. The protein GDT-TSs were obtained from the LS-SVM model constructed from the optimal parameter values. Then the test results show that the design model has obvious advantages in absolute error and mean square error, which proves the rationality and validity of the model.
【学位授予单位】：河南师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：Q51;TP181

【参考文献】