当前位置:主页 > 医学论文 > 西医药论文 >

基于贝叶斯的质谱数据分析方法

发布时间:2018-01-21 20:00

  本文关键词: 质谱 蛋白质组学 贝叶斯理论 机器学习 出处:《华东师范大学》2012年硕士论文 论文类型:学位论文


【摘要】:伴随着人类基因组计划发展起来的基因组学为人类探索生命的原理起来划时代的重要作用。但是在其发展的同时,人们慢慢认识到只从基因水平上去探索生命的本质是完全不够的,需要从更根本的本质上去研究揭示生命现象,这样蛋白质组学应运而生。质谱作为一种有效的工具为科学家们研究蛋白质提供了很大的帮助。 本文首先介绍了目前主流的基于质谱的蛋白质分析流程和技术,并介绍了一些常用的基于质谱的蛋白质的算法,包括SEQUEST、MASCOT、X! Tandom中的算法。总结了蛋白质定量分析的两种策略同位素标记方法和无标记定量技术,并分析了他们的区别和各自的优点,介绍了目前基于质谱的蛋白质翻译后修饰发现与鉴定的常用算法。 现有的基于质谱的蛋白质鉴定算法各有千秋,各有各的优点。我们尝试利用机器学习并结合朴素贝叶斯理论对现有的算法进行整合。选取的机器学习方法包括SVM、LDA、logistic回归、KNN、贝叶斯置信网络、人工神经网络等方法。选取的分类特征包括SEQUEST算法中提供的多种参数。训练数据来自于18组已知的混合蛋白的质谱数据。通过机器学习的方法得到分类器的分界面,并计算阴阳极样本在分类器分类函数作用下的条件分布。利用阴阳极的条件分布和新样本在分类器下的特征得分,在均匀先验的条件下通过朴素贝叶斯的方法就可以计算出蛋白质鉴定结果的后验概率。通过交叉验证的结果表明我们的算法的正确率在80%-90%,同时可以保证召回率达到40%-50%,具有加好的实用价值。 蛋白质翻译后修饰的鉴定一直是蛋白质组研究里面一个重要的领域。通常的基于质谱的蛋白质翻译后修饰的鉴定的方法是机器学习和直接与已知数据库对比。与已知数据库对比的算法时间复杂度较高,同时因为比对的次数很多算法的假阳性率较高。我们尝试利用基于投影距离的聚类算法来对质谱数据先进行聚类分析,然后再在此基础上进行翻译后修饰的识别,这样不仅降低了算法的时间复杂度,而且也提高了精度。投影方向是利用已知样本基于LDA和SVM计算出来的,使得在投影方向上类内距离尽可能的小,类间的距离尽可能大。得到投影方向之后在通过对未知样本两两之间进行投影距离的计算得到距离矩阵。通过利用距离矩阵和常用的聚类算法对数据直接进行聚类分析。得到的聚类结果中的每一个类可能就是同一肽段的不同的翻译后修饰的实例,通过比较同一类内的结果可以快速高效的发现可能存在的翻译后修饰。在已知数据的交叉验证下算法的正确率和召回率都在70%左右 自从Google提出了云计算的概念,各种基于云计算应用层出不穷,蛋白质质谱数据分析具有高通量和可并行化的特点,可以方便的部署到云计算平台上。我们提出了两种部署策略并比较了两种策略的优点和不足。
[Abstract]:Genomics, which has been developed with the Human Genome Project, plays an epoch-making role in exploring the principles of human life, but at the same time. People have come to realize that it is not enough to explore the nature of life only at the gene level, and that it is necessary to study and reveal the phenomenon of life from a more fundamental nature. Mass spectrometry is an effective tool for scientists to study proteins. This paper first introduces the current mainstream flow and technology of protein analysis based on mass spectrometry, and introduces some commonly used algorithms of protein based on mass spectrometry, including SEQUESTE MASCOTX! The algorithms in Tandom. Two strategies for protein quantitative analysis, isotope labeling and unlabeled quantification, were summarized, and their differences and advantages were analyzed. In this paper, the common algorithms of protein posttranslational modification discovery and identification based on mass spectrometry are introduced. The existing protein identification algorithms based on mass spectrometry have their own advantages and disadvantages. Each has its own advantages. We try to use machine learning and combining with naive Bayes theory to integrate the existing algorithms. The selected machine learning methods include SVMN LDA-logistic regression. KNNs, Bayesian confidence Networks. Artificial neural network and other methods. The selected classification features include a variety of parameters provided in the SEQUEST algorithm. Training data from 18 known mass spectrum data of mixed proteins. Obtained by machine learning method. Interface to the classifier. The conditional distribution of the anode and cathode samples under the classifier classification function is calculated. The conditional distribution of the cathode and cathode and the characteristic score of the new sample under the classifier are calculated. The posteriori probability of protein identification results can be calculated by naive Bayes method under the condition of uniform priori. The results of cross-validation show that the accuracy of our algorithm is between 80% and 90%. At the same time, the recall rate can reach 40-50, with good practical value. The identification of post-translational modification of proteins has been an important field in proteome research. The common methods of identification of post-translational modification of proteins based on mass spectrometry are machine learning and direct comparison with known databases. Compared with the known database, the algorithm has higher time complexity. At the same time, because of the high false positive rate of many algorithms, we try to use the projection distance based clustering algorithm to cluster the mass spectrum data first. Then the post-translational modification recognition is carried out on this basis, which not only reduces the time complexity of the algorithm, but also improves the accuracy. The projection direction is calculated by using known samples based on LDA and SVM. Make the distance between classes in the projection direction as small as possible. The distance between classes is as large as possible. After the projection direction is obtained, the distance matrix is obtained by calculating the projection distance between unknown samples. The distance matrix is directly clustered by using the distance matrix and the usual clustering algorithm. Cluster analysis. Each of the resulting clusters may be an example of a different post-translational modification of the same peptide. By comparing the results within the same class, we can quickly and efficiently find possible posttranslational modifications. The correct rate and recall rate of the algorithm are about 70% under the cross-validation of known data. Since Google put forward the concept of cloud computing, a variety of cloud-based applications have emerged, protein mass spectrometry data analysis has the characteristics of high throughput and parallelism. We propose two deployment strategies and compare the advantages and disadvantages of the two strategies.
【学位授予单位】:华东师范大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:R346

【共引文献】

相关期刊论文 前10条

1 刘焕香;;《概率论与数理统计》的教学探索[J];安阳师范学院学报;2010年05期

2 黄河清,林庆梅,MIAO Hai,Won-Suk KIM;猪血管紧张肽的质谱特性[J];动物学杂志;2003年05期

3 刘书芝,徐书荣;质谱学中与质量相关的量和单位[J];编辑学报;2005年05期

4 费绍金;周克元;;三本“概率统计”教学困境成因与解困方略[J];教育与教学研究;2010年12期

5 刘焕香;;概率论与数理统计课程的教学探索[J];时代教育(教育教学);2010年09期

6 陈雪平;马强;蒋卫军;陈绚青;;本科概率统计教学的几点探索[J];江苏技术师范学院学报;2010年09期

7 崔智超,王青建;数理统计学源流及应用[J];大连教育学院学报;2005年02期

8 刘旭华;田英;陈薇;;对研究生数理统计课程教学的思考与探索[J];高等农业教育;2010年07期

9 柴根象;徐建平;;突出统计思维能力的培养——统计学教学浅谈[J];大学数学;2006年02期

10 张建侠;宋红伟;;统计学知识建构中的逻辑思维方法[J];广西教育;2011年21期

相关会议论文 前1条

1 于惠兰;裴承新;胡真;张兰波;;高效液相色谱-四极杆飞行时间质谱检测人血清中芥子气染毒[(S-HETE)Cys-Pro-Phe]三肽加合物[A];公共安全中的化学问题研究进展(第二卷)[C];2011年

相关博士学位论文 前10条

1 程宇;马铃薯蛋白水解物在水包油乳状液中的抗氧化作用及机理研究[D];江南大学;2010年

2 张艳萍;贻贝蛋白中ACE抑制肽的制备及其构效关系研究[D];浙江工商大学;2011年

3 韦星船;姜黄素类似物的合成及抗肿瘤活性研究[D];广东工业大学;2011年

4 李波;羊栖菜褐藻糖胶的提取纯化和结构研究[D];江南大学;2005年

5 陈益;抗HIV前药及其与蛋白质弱相互作用的电喷雾质谱研究[D];郑州大学;2005年

6 周慧;电喷雾质谱及其联用技术在药物分析中的应用[D];浙江大学;2005年

7 王玉t,

本文编号:1452431


资料下载
论文发表

本文链接:https://www.wllwen.com/xiyixuelunwen/1452431.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户714de***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com