基于智能计算的蛋白质残基溶剂可及性和功能的分析预测
发布时间:2018-03-09 11:09
本文选题:智能计算 切入点:机器学习 出处:《东北师范大学》2017年博士论文 论文类型:学位论文
【摘要】:蛋白质结构决定其相应的功能,蛋白质结构研究是蛋白质组学研究的基础。蛋白质残基溶剂可及性是一种基础的蛋白质结构信息,它对于分析蛋白质空间三维构象、构建蛋白质三维结构、预测蛋白质与其它分子相互作用以及蛋白质自身的新陈代谢和进化提供重要的基础性意义。蛋白质通过与其它分子(核酸、蛋白质、小分子配体)之间的相互作用表达自身的功能。蛋白质功能残基的分析和识别对于研究蛋白质的功能表达具有重要的现实意义。传统的基于生物物理和生物化学获取蛋白质结构和功能信息的方法需要精密昂贵的实验仪器,繁琐的实验过程和密集的人力资源。这些传统的方法受益于生物信息学的发展,后者通过使用智能计算的方式提供准确的工具预测蛋白质结构信息和功能残基。事实上,仅有约2‰的蛋白质具有较为准确的结构数据。面对海量增长的未知结构和功能的蛋白质,基于智能计算的方法充分发挥了计算机高效便捷和准确的特性,给进一步实验探究提供了丰富的宝贵的线索。本文针对蛋白质残基的溶剂可及性和功能进行了分析和预测,主要成果如下:(1)提出了一种基于权重滑动窗口策略和粒子群优化的回归预测蛋白质残基暴露水平(溶剂可及性)的方法。首先,提取了基于序列的五种类型的特征来编码蛋白质每一个残基及其邻近残基。为了精确量化邻近残基溶剂可及性对于中心残基的影响,采用了基于权重的滑动窗口策略赋予滑动窗口中每个位置不同的权重。最后,使用粒子群优化算法对于支持向量回归算法中的参数进行寻优。该方法在两个基准数据集上的预测性能较于前人的研究方法有较大的提升。该研究探究了不同的回归算法对于模型的影响,对比了不同的参数寻优方法对于预测性能的影响,分析了回归预测误差的来源以及20种氨基酸的平均误差水平。为了验证该方法的泛化性能,同时与之前的预测工具进行对比,该方法连同领域内知名的若干预测工具在独立测试集上进行对比试验。独立测试集上结果证明了本文方法具有较好的泛化性能。(2)提出了一种基于代价敏感性集成学习和空间聚类算法预测抗原蛋白质与抗体相互作用的抗原决定残基及潜在表位的方法。首先,使用五种基于序列的特征对抗原蛋白质残基进行编码,这些特征包括保守性特征、二级结构特征、无序区域特征、二肽构成特征和理化属性特征。为了提高计算速度并且去除冗余特征,使用Fisher-Markov Selector对特征与样本标签进行相关性排序,然后使用增量特征选择方法获得最优特征子集。抗原表位预测是一个典型的不平衡数据分类问题,为了克服传统机器学习在此类问题上的缺陷,本研究引入基于代价敏感性的集成学习算法。考虑到绝大多数抗原决定残基或序列连续或空间邻近的情况,本研究在预测抗原决定残基的基础上,引入空间聚类算法预测这些抗原决定残基可能形成的潜在表位。该方法分别在基准测试集和独立测试集上与前人的方法进行对比,实验结果证明了该方法的有效性和良好的泛化性能。(3)提出了一种基于快速自适应集成学习和配体特异性策略预测亚铁血红素绑定残基的方法。首先根据亚铁血红素绑定残基的特性,综合使用了氨基酸分布特征、motif序列模板特征、表面倾向性特征和二级结构特征。特征分析发现,亚铁血红素绑定残基在半胱氨酸和组氨酸上呈现出富集分布,倾向于蛋白质表面的凹陷区域,较多的集中在二级结构的衔接处。亚铁血红素绑定残基预测是一个典型的不平衡数据分类问题。本研究针对性地提出一种新的快速自适应集成学习算法,该算法旨在通过动态监控和调节子数据集中正负样本比例实现对于子分类器的优化。该算法速度较快同时具有较佳的自适应性;研究中特别针对两种主要的亚铁血红素绑定配体类型引入了配体特异性策略,该策略能够显著提高传统的通用模型的预测准确率。基准测试集和独立测试集上的实验分别证明了该方法相对于其它算法的优越性和良好的泛化性能。文中同时分析论述了测试集正负样本比例对算法造成的潜在影响。最后,本研究发布了在线预测工具,为生物学家高效计算亚铁血红素蛋白质提供了有益的帮助。
[Abstract]:Protein structure determines its function, the study of protein structure is the basis of proteomics research. Protein residue solvent accessibility is one of the basic information of protein structure, for its analysis of the three-dimensional conformation of protein space construction, protein structure, and predict the evolution of The new supersedes the old. protein interactions with other molecules and their protein provide the fundamental significance. Through the protein and other molecules (nucleic acid, protein and small molecule ligands) expression of the interaction between its function. Analysis and identification of protein functional residues has important practical significance for the study of the expression of protein function. The traditional bio physical and bio chemical acquisition method based on protein structure and function the information need expensive instrument precision, tedious experiment process and intensive human resources. The traditional party Benefit from the development method of bioinformatics, the latter through the use of intelligent computing methods provide an accurate tool for prediction of protein structural information and functional residues. In fact, only about 2 per thousand protein structure with more accurate data. In the face of unknown structure and function of the massive growth of the protein, method based on Intelligent Computing and give full play to the computer efficient and convenient and accurate characteristics, provides a wealth of valuable clues for further experimental research. This paper aimed at the solvent residues in protein and function is analyzed and forecasted. The main results are as follows: (1) proposed a weighted sliding window method and particle swarm optimization based on the prediction of protein residue based on the level of exposure (solvent accessibility) method. First, to extract protein encoding each residue and its neighboring residues features five types based on sequence for. Effect of precise quantification of neighboring residues and solvent accessibility for center residues, the weighted sliding window strategy gives the sliding window in each different position based weights. Finally, using particle swarm optimization algorithm for the parameters of support vector regression algorithm in optimization. The method of performance prediction compared with the method in the previous two on the benchmark data sets have greatly improved. This study explores different regression algorithm for the model, comparison of the effects of different parameters optimization methods on the prediction performance, analyzes the sources of regression prediction error and the average error level of 20 kinds of amino acids. In order to verify generalization performance of the method, at the same time compared with the previous prediction tools, methods of prediction tools together with several well-known within the field in the independent test set were compared. Results show that on the independent test set This method has good generalization performance. (2) proposed a prediction of antigen protein and antigen antibody interaction cost sensitive ensemble learning and spatial clustering algorithm for determining residues and potential epitope based method. Firstly, using five kinds of sequence based on the characteristics of antigenic residues of protein encoding, these features include conservative characteristics, two level structure, disordered region characteristics, composition and characteristics of two peptide physicochemical properties. In order to improve the calculation speed and remove the redundant features, the use of Fisher-Markov Selector for relevance ranking features and labels, and then use the incremental feature selection method to obtain the optimal feature subset. Epitope prediction is a typical imbalanced data classification problems, in order to overcome the defects of traditional machine learning on such issues, the introduction of integrated learning based on cost sensitive Algorithm. Considering the vast majority of epitope residues or sequence or spatial proximity, based on determining residues in the prediction of antigen on the introduction of spatial clustering algorithm to predict these antigenic determinants of potential residues may form the table. This method were compared with the previous methods in the benchmark test set and independent test set, the experimental results demonstrate the effectiveness of the method and good generalization performance. (3) proposed a method for prediction of the heme binding residues of fast adaptive learning and integration strategy based on ligand specificity. According to the properties of the heme binding residues, the integrated use of the amino acid distribution characteristics, motif sequence template the surface characteristics, tendency features and two features. Feature analysis found that heme binding residues in cysteine and histidine showed enrichment distribution, tend to Depression area of the protein surface, more focused on the convergence of two level structure. The heme binding residues prediction is a typical unbalanced data classification. This paper proposes a new fast adaptive ensemble learning algorithm, the algorithm through dynamic monitoring and regulating sub data set of positive and negative samples the proportion for optimization of classifier. The algorithm is faster and has better adaptability; in particular, for the two major types of heme binding ligands into the ligand specific strategy, this strategy can significantly improve the prediction accuracy of the traditional model. The benchmark test set and independent experiments on the test set we prove this method compared with other algorithm superiority and good generalization performance. The paper also discusses the positive and negative samples of test set ratio algorithm The potential impact. Finally, the study published an online prediction tool that helped biologists to efficiently calculate heme protein.
【学位授予单位】:东北师范大学
【学位级别】:博士
【学位授予年份】:2017
【分类号】:Q51
【参考文献】
相关期刊论文 前2条
1 唐旭清;朱平;;后基因组时代生物信息学的发展趋势[J];生物信息学;2008年03期
2 马袁君;程震龙;孙野青;;生物信息学及其在蛋白质组学中的应用[J];生物信息学;2008年01期
相关博士学位论文 前1条
1 张华;蛋白质残基深度、柔性和功能的预测与分析[D];南开大学;2009年
,本文编号:1588268
本文链接:https://www.wllwen.com/shoufeilunwen/jckxbs/1588268.html
教材专著