基于平衡分类算法的蛋白质二级结构预测
发布时间:2018-07-15 09:00
【摘要】:蛋白质在生命过程中起着非常关键的作用,是生命活动的物质承担者。而蛋白质的结构决定了其功能,因此通过蛋白质结构预测其在生命过程中的功能非常重要。蛋白质结构分为四个层次:一级结构是指蛋白质序列的氨基酸残基排列;二级结构是指蛋白质多肽链上的局部空间构象(螺旋helix、片层Stand和卷曲coil);三级结构蛋白质多肽链上所有原子的空间位置;而拥有多条多肽链的蛋白质还具有四级结构,也就是多条多肽链的相对位置。与蛋白质功能直接相关的是蛋白质的三级结构,然而本文很难直接获取蛋白质的三级结构信息,传统的物理化学检测方法耗时耗力,很难胜任,直接从蛋白质一级序列信息预测蛋白质三级结构又及其困难,因此蛋白质二级结构预测作为一级结构与三级结构的桥梁存在广泛的前景。不过由于蛋白质二级结构中片层结构含量普遍较低,加上传统机器学习分类器无法采集蛋白质一级结构中位点远端的相互作用,使得片层结构预测率不足,直接影响蛋白质二级结构预测的效果。本文试图改进已有的PSIPRED算法(一种基于人工神经网络的分类算法,以序列的位置特异性得分矩阵为样本输入),引入平衡的分类机制,使得算法预测更为平衡、有效,最后应用于蛋白质三级结构中蛋白质结构类的预测。本文做出的改进尝试及其创新点如下:1.尝试四种改进的策略,分别是:改变神经网络的输入编码,引入更多与远端相互作用相关的序列信息,例如残基分子量大小、等电点、亲水性等;采取平衡的抽样策略,在训练过程中对含量较低的结构重复抽样;在训练过程中采用加权的代价函数;对神经网络的输出进行加权评估以平衡分类器的输出。最终发现,采用对神经网络输结果进行加权的策略最为有效,本文在改进的CB513数据集上采用8折交叉验证得到的总体准确率为74.28%,相应的beta-sheet准确率为63.73,比原始方法高出2.34个百分点。2.以已经预测的蛋白质二级结构的混沌游戏表示chaos games representation(CGR)作为蛋白质结构类预测(structural classes prediction)的输入特征交于神经网络进行蛋白质结构类的预测。最终在Astral40数据集上获得了71%的准确率,比直接用一级序列信息的CGR方法高出许多。本文采用的方法能够较为有效地预测蛋白质的结构类。
[Abstract]:Protein plays a key role in the life process and is the material carrier of life activities. The structure of protein determines its function, so it is very important to predict its function in life process by protein structure. The protein structure is divided into four levels: the primary structure refers to the amino acid residues arrangement of the protein sequence; The secondary structure refers to the local spatial conformation of the protein polypeptide chain (helix, lamellar stand and coiled coil); tertiary structure), the spatial position of all atoms in the protein polypeptide chain, while the protein with multiple polypeptide chains also has a quaternary structure. This is the relative position of multiple polypeptide chains. The tertiary structure of protein is directly related to the function of protein. However, it is difficult to obtain the information of tertiary structure of protein directly in this paper. It is very difficult to predict the tertiary structure of protein directly from the information of protein primary sequence, so the prediction of protein secondary structure as a bridge between primary structure and tertiary structure has a broad prospect. However, because of the low content of lamellar structure in protein secondary structure and the inability of traditional machine learning classifier to collect the interaction between the distal sites of protein primary structure, the prediction rate of lamellar structure is insufficient. The prediction effect of protein secondary structure is directly affected. This paper attempts to improve the existing PSIPRED algorithm (a classification algorithm based on artificial neural network, which takes the position specificity score matrix of the sequence as the sample input), and introduces a balanced classification mechanism to make the prediction more balanced and effective. Finally, it is applied to the prediction of protein structure class in protein tertiary structure. The improvements and innovations made in this paper are as follows: 1. Four improved strategies are tried: changing the input coding of neural networks, introducing more sequence information related to remote interactions, such as molecular weight of residues, isoelectric point, hydrophilicity, etc., and adopting a balanced sampling strategy. In the process of training, the structure with low content is sampled repeatedly, the weighted cost function is used in the training process, and the output of neural network is evaluated weighted to balance the output of the classifier. Finally, it is found that the strategy of weighting the results of neural network is the most effective. The overall accuracy of the improved CB513 data set by 20% discount cross validation is 74.28% and the corresponding beta-sheet accuracy is 63.73, which is 2.34% higher than the original method. The predicted chaotic game of protein secondary structure (chaos games representation () is used as the input feature of protein structure class to predict (structural classes prediction). The neural network is used to predict the protein structure class. Finally, 71% accuracy is obtained on the Astral40 dataset, which is much higher than the first order sequence information method. The method used in this paper can effectively predict the structural classes of proteins.
【学位授予单位】:河南科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q51;TP183
本文编号:2123536
[Abstract]:Protein plays a key role in the life process and is the material carrier of life activities. The structure of protein determines its function, so it is very important to predict its function in life process by protein structure. The protein structure is divided into four levels: the primary structure refers to the amino acid residues arrangement of the protein sequence; The secondary structure refers to the local spatial conformation of the protein polypeptide chain (helix, lamellar stand and coiled coil); tertiary structure), the spatial position of all atoms in the protein polypeptide chain, while the protein with multiple polypeptide chains also has a quaternary structure. This is the relative position of multiple polypeptide chains. The tertiary structure of protein is directly related to the function of protein. However, it is difficult to obtain the information of tertiary structure of protein directly in this paper. It is very difficult to predict the tertiary structure of protein directly from the information of protein primary sequence, so the prediction of protein secondary structure as a bridge between primary structure and tertiary structure has a broad prospect. However, because of the low content of lamellar structure in protein secondary structure and the inability of traditional machine learning classifier to collect the interaction between the distal sites of protein primary structure, the prediction rate of lamellar structure is insufficient. The prediction effect of protein secondary structure is directly affected. This paper attempts to improve the existing PSIPRED algorithm (a classification algorithm based on artificial neural network, which takes the position specificity score matrix of the sequence as the sample input), and introduces a balanced classification mechanism to make the prediction more balanced and effective. Finally, it is applied to the prediction of protein structure class in protein tertiary structure. The improvements and innovations made in this paper are as follows: 1. Four improved strategies are tried: changing the input coding of neural networks, introducing more sequence information related to remote interactions, such as molecular weight of residues, isoelectric point, hydrophilicity, etc., and adopting a balanced sampling strategy. In the process of training, the structure with low content is sampled repeatedly, the weighted cost function is used in the training process, and the output of neural network is evaluated weighted to balance the output of the classifier. Finally, it is found that the strategy of weighting the results of neural network is the most effective. The overall accuracy of the improved CB513 data set by 20% discount cross validation is 74.28% and the corresponding beta-sheet accuracy is 63.73, which is 2.34% higher than the original method. The predicted chaotic game of protein secondary structure (chaos games representation () is used as the input feature of protein structure class to predict (structural classes prediction). The neural network is used to predict the protein structure class. Finally, 71% accuracy is obtained on the Astral40 dataset, which is much higher than the first order sequence information method. The method used in this paper can effectively predict the structural classes of proteins.
【学位授予单位】:河南科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q51;TP183
【参考文献】
相关期刊论文 前1条
1 隋海峰;曲武;钱文彬;杨炳儒;;基于混合SVM方法的蛋白质二级结构预测算法[J];计算机科学;2011年10期
相关硕士学位论文 前4条
1 张安胜;深度学习在蛋白质二级结构预测中的应用研究[D];安徽大学;2015年
2 林锦华;基于隐马尔可夫模型的蛋白质二级结构预测[D];福建农林大学;2012年
3 孙文恒;基于遗传算法和BP神经网络的蛋白质二级结构预测研究[D];兰州大学;2008年
4 于莉;基于PBIL算法的蛋白质二级结构预测方法研究[D];国防科学技术大学;2006年
,本文编号:2123536
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2123536.html