蛋白质远同源性检测和DNA结合蛋白识别研究
发布时间:2018-03-09 07:29
本文选题:蛋白质远同源性检测 切入点:DNA结合蛋白 出处:《哈尔滨工业大学》2017年硕士论文 论文类型:学位论文
【摘要】:蛋白质是构成生命的物质基础,是生命活动的主要承担者。在后基因组时代,随着蛋白质测定技术的发展,蛋白质序列数据库规模呈爆炸式的增长。因此,对蛋白质的识别在生物学中具有重要的意义。本课题对蛋白质的结构和功能方面进行深入的研究。在蛋白质结构方面,我们选取蛋白质远同源性作为研究,不同物种中具有相同或相似功能的蛋白质具有明显序列同源性,基于蛋白质序列同源性来判别未知类别的蛋白质序列的超家族归属。在蛋白质功能方面,我们选取了DNA结合蛋白作为研究。DNA结合蛋白在生命体中扮演着重要的角色。在基因的转录、重组、修复、复制等方面起了重要的作用。本文通过处理蛋白质的一级序列,结合机器学习的方法对上面的两个特定问题进行了深入的研究,具体的研究内容如下:蛋白质远源检测是蛋白质结构研究的基础。本文提出伪二肽结构状态成分(Pseudo Dimer Composition,PDC)的概念。针对原始的伪氨基酸组成的信息不足,我们提出了改进的方案。首先采用包含进化信息的频率谱将原始的序列转换为包含进化信息的蛋白质序列。然后采用PDC特征提取方法将蛋白质一级序列转换为固定长度的向量。结合支持向量机和集成学习策略预测蛋白质的超家族的类别。该集成策略的方法是将每个家族的ROC值作为其权重,进行线性集成。该方法的AUC为0.927,AUC50为0.749,该实验表明其方法优于该领域的其他方法。DNA结合蛋白识别是蛋白质功能研究的一个重要方向。本文首次将包含进化信息的频率谱和伪氨基酸组成应用到该问题上。首先通过序列谱和伪氨基酸组成将蛋白质序列变为长度固定的特征向量。采用支持向量机构建分类器识别DNA结合蛋白。本章采取的集成方式是异态集成方法,通过扩展样本得到更多的训练模型进行集成。在独立测试集上,实验结果的准确率为76.56%,AUC为0.8392。另外,通过分析支持向量机不同特征的权重,可分析对应的氨基酸在识别过程的重要程度,进而分析其在生物学上的特征。针对伪氨基酸组成的提取信息不足的问题,我们提出一种融合K元氨基酸组成和自交叉协方差结合的方法。该方法克服了伪氨基酸组成包含信息不足的问题。K元氨基酸组成方法包含了氨基酸距离对的信息,自交叉协方差方法包含了全局的氨基酸的理化信息。通过优化特征参数组合,我们可以进一步提高对DNA结合蛋白的准确率。在独立测试集上的实验结果显示,该方法的预测精度为75.16%。该方法相较于其他方法有进一步提升。本文在DNA结合蛋白问题上提出一种基于近邻传播聚类策略的方法进行选择性集成的方法。为了提高预测的精度和进一步深入研究集成方法,我们采用了基于缩减字母表距离对的特征提取策略。通过近邻传播聚类的集成策略,对656个基本分类器聚类集成。该方法在独立测试集上的准确率为83.87%,相比于其他方法其实验性能有进一步提升。
[Abstract]:Protein is a material base of life, is mainly responsible for the activities of life. In the post genomic era, with the development of technology of determination of protein, protein sequence database, the scale of explosive growth. Therefore, the protein recognition has important significance in biology. This research on protein structure and function of study on protein structure, protein remote homology research as we selected, with the same or similar functions in different species have obvious protein sequence homology superfamily protein sequences belonging protein sequence homology to determine the unknown. Based on protein function, we selected the DNA binding protein as the research.DNA binding protein plays an important role in life. In gene transcription, recombination, repair, replication plays a important role Use. Through processing the protein primary sequences, combined with machine learning methods conducted in-depth research on two specific questions above, the specific contents are as follows: protein far source detection is the basis for the research of protein structure. In this paper, two pseudo peptide structure state component (Pseudo Dimer Composition, PDC) concept according to the composition of pseudo amino acid deficiency. The original information, we propose the improved scheme. Firstly, the frequency spectrum of evolutionary information contains the original sequence into a protein sequence contains the evolutionary information. Then the PDC feature extraction method of the protein sequence is converted into a fixed length vector. Combined with the prediction of super family category protein support vector machines and integrated learning strategies. The method of integrated strategy is that each family ROC value as the weight, linear integration. This method is 0 AUC .927, AUC50 is 0.749, the experimental results show that the.DNA method is better than the other methods in the field of protein identification is an important direction of research on protein function. In this paper, for the first time will contain the evolutionary information of the frequency spectrum and pseudo amino acid composition is applied to the problem. Firstly, through sequence spectrum and pseudo amino acid composition of protein sequence into features fixed length vector. By using the support vector machine classifier to build a DNA binding protein. This chapter adopts the integration mode is the ensemble method, by extending the sample to get the training model more integrated. In the independent test set, the accuracy of experimental results was 76.56%, AUC was 0.8392. in addition, support vector machine with different feature weight through the analysis, corresponding analysis of the amino acids in the degree of importance of the recognition process, and then analyzed the biological characteristics. According to the extracted pseudo amino acid composition The problem of insufficient information, we propose a method based on K meta amino acid composition and combining self cross covariance matrix. This method overcomes the problem of pseudo amino acid composition.K amino acids contain insufficient information which contains information on amino acid distance method, self cross covariance methods include physical and chemical information of global amino acids. By optimizing the feature combination of parameters, we can further improve the accuracy of the DNA binding protein. In the independent test set and the experimental results show that the prediction accuracy of this method is 75.16%. this method compared with other methods in this paper. To further enhance the DNA binding protein on the paper presents a method for selective method of affinity propagation clustering strategy based on integration in order to improve the accuracy of prediction and further research on the integration method, we use the reduced alphabet distance on feature extraction based on Strategy A clustering algorithm based on affinity propagation clustering is applied to ensemble 656 basic classifiers. The accuracy of the algorithm on independent test set is 83.87%. Compared with other methods, its performance is further improved.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q811.4;TP311.13
【参考文献】
相关期刊论文 前4条
1 敖丽敏;罗存金;;基于神经网络集成的DNA序列分类方法研究[J];计算机仿真;2012年06期
2 张春霞;张讲社;;选择性集成学习算法综述[J];计算机学报;2011年08期
3 Kathy L. MOSER,Eric J. TOPOL;An ensemble method for gene discovery based on DNA microarray data[J];Science in China(Series C:Life Sciences);2004年05期
4 张春霆;生物信息学的现状与展望[J];世界科技研究与发展;2000年06期
相关博士学位论文 前1条
1 邹权;基于二级结构的非编码RNA挖掘方法研究[D];哈尔滨工业大学;2009年
,本文编号:1587548
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1587548.html