基于混沌游戏表示的蛋白质3D图形表示及其应用
发布时间:2018-01-21 20:20
本文关键词: 混沌游戏表示 蛋白质相似性 支持向量机 抗癌多肽 出处:《山东大学》2017年硕士论文 论文类型:学位论文
【摘要】:随着人类蛋白质组计划(HPP)的启动和后基因组时代的来临,生物领域产生了海量的蛋白质序列数据。应用分子生物学手段处理和分析这些序列不仅耗费大量时间和物资,还存在不稳定性。根据"序列决定结构、结构决定功能"这一核心思想,越来越多的科研人员开始通过数学算法和计算机技术处理大量蛋白质序列,并从中提取出有意义的结构和功能信息,进而指导和支持实验技术。生物信息学对序列数据的处理模型被广泛应用于各个研究领域,包括药物研发、疾病诊断等与人类健康息息相关的方面。由于蛋白质的组成复杂、功能多样,蛋白质序列的分析难度会远远大于DNA和RNA序列。现有的基于蛋白质序列的分析工具,往往存在生物意义不足、可视性差、时间复杂度高、准确度低等各种局限性。鉴于此,本文从生物背景出发,结合信息学和统计学理论,提出了一种时间复杂度低且生物意义明显的蛋白质三维图形表示。之后,将其应用于蛋白质序列相似性分析和功能蛋白预测两个生物信息的重要领域中,以验证该方法的可行性。主要研究工作如下:1.基于混沌游戏表示(CGR,Chaos Game Representation)的特点,提出了一种针对密码子的逆向CGR图形表示,并结合氨基酸的重要理化性质将蛋白质序列一一对应地映射到三维空间中。逆向CGR模型能将同义密码子聚集在一起,与生物学中的摆动假说一致。之后,基于高效的动量向量提取方法,提出一种针对三维曲线的动量向量提取算法,避免了序列长度不同对应用的影响,极大降低了时间复杂度,提高了对较大数据的处理能力。2.将新提出的三维图形表示应用于三个经典蛋白质进化分析数据集上,并与ClustalW以及最近的一些非序列比对算法比较,结果显示逆向CGR图形表示取得了相似或更好的结果,与实际生物进化关系一致。3.为验证图形表示在其他序列分析中的有效性,本文融合图形表示提取的向量和氨基酸组分、理化性质分类后二联体组分等统计信息,结合支持向量机建立预测器。针对抗癌多肽、细菌黏附素和真核神经毒蛋白三种数据集进行学习和预测,检验方法为五折交叉验证:在抗癌多肽main和alternative数据集中准确率高达96%和97.73%,远远超过参考文献中的其他方法;在两个balanced数据集中准确率达到88.82%和86.11%,与Tyagi方法的最佳结果相似,但Tyagi在两个数据集中表现最好的方法是不同的,也即本文方法能在两个数据集都能保证很好的结果,但Tyagi的方法不太稳定;在细菌黏附素和真核神经毒蛋白数据集中预测准确率分别为92.75%和98.00%,远远超过参考文献中的其他方法。实验证明,本文提出的三维图形表示方法,不仅具有很强的生物意义和较低的时间复杂度,还在序列相似性分析、功能性蛋白二分类预测中有出色表现,这也验证了该方法的可行性和普适性。
[Abstract]:With the initiation of the human proteome project HPPs and the advent of the post-genome era. The biological field has produced a large amount of protein sequence data. The application of molecular biological means to process and analyze these sequences not only consumes a lot of time and material, but also has instability. As the core idea of "structure determines function", more and more researchers begin to process a large number of protein sequences through mathematical algorithms and computer techniques, and extract meaningful structural and functional information from them. Bioinformatics model of sequence data processing has been widely used in various research fields, including drug development. Disease diagnosis and other aspects are closely related to human health. Because of the complexity of protein composition, the function is diverse. The difficulty of protein sequence analysis will be much greater than that of DNA and RNA sequences. The existing analysis tools based on protein sequences often have insufficient biological significance poor visibility and high time complexity. In view of the limitations of low accuracy, this paper proposes a protein 3D representation with low time complexity and significant biological significance from biological background, combining with information and statistics theory. It is applied to the two important fields of protein sequence similarity analysis and functional protein prediction to verify the feasibility of this method. The main research work is as follows: 1. The CGR is represented based on chaotic game. Chaos Game representation, a reverse CGR graphical representation for codon is proposed. Combined with the important physical and chemical properties of amino acids, the protein sequences were mapped to 3D space. The converse CGR model could gather synonymous codon together, which was consistent with the wobble hypothesis in biology. Based on the efficient momentum vector extraction method, a momentum vector extraction algorithm for 3D curves is proposed, which avoids the influence of different sequence length on the application and greatly reduces the time complexity. The new 3D graphic representation is applied to the three classical protein evolution analysis data sets. 2. Compared with ClustalW and some recent non-sequential alignment algorithms, the results show that the reverse CGR graphical representation has achieved similar or better results. In order to verify the validity of graphic representation in other sequence analysis, this paper fuses graph representation to extract vector and amino acid component. After classification of physical and chemical properties, a predictor was established by combining with support vector machine (SVM). The data sets of anticancer polypeptide, bacterial adhesin and eukaryotic neurotoxin were studied and predicted. The test method is 50% cross validation: the accuracy of main and alternative data sets is 96% and 97.73, which is far higher than the other methods in the reference. The accuracy of the two balanced datasets was 88.82% and 86.11, similar to the best results of the Tyagi method. However, the best methods of Tyagi in two data sets are different, that is, the method in this paper can guarantee good results in both data sets, but the method of Tyagi is not very stable. The accuracy of prediction in the data set of bacterial adhesin and eukaryotic neurotoxin was 92.75% and 98.00, respectively, which was much higher than the other methods in the reference. The proposed method not only has strong biological significance and low time complexity, but also has a good performance in sequence similarity analysis and functional protein two-classification prediction. This also verifies the feasibility and universality of the method.
【学位授予单位】:山东大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q51
【相似文献】
相关期刊论文 前1条
1 石龙;黄海兰;;基于DNA序列混沌游戏表示的相似性分析[J];吉首大学学报(自然科学版);2009年03期
相关硕士学位论文 前3条
1 许春蕊;基于混沌游戏表示的蛋白质3D图形表示及其应用[D];山东大学;2017年
2 刘斐;基于基因组混沌游戏表示的亲缘分析研究[D];湘潭大学;2013年
3 李博;线粒体完全基因组混沌游戏表示的Markov模型模拟[D];湘潭大学;2009年
,本文编号:1452473
本文链接:https://www.wllwen.com/shoufeilunwen/benkebiyelunwen/1452473.html