蛋白质组质谱数据分析平台的建立及其在大规模数据分析中的应用

发布时间：2018-05-25 13:48

本文选题：蛋白质组学 + 质谱　；参考：《中国人民解放军军事医学科学院》2017年博士论文

【摘要】：蛋白质组学是后基因组时代生命科学研究的热点之一,它研究生物体细胞、器官乃至组织的蛋白质表达规律,并阐明其生物学意义。蛋白质组学研究的重要技术之一是生物质谱技术,对着生物质谱技术的发展,促进了大规模蛋白质组研究的开展,实现高通量、高灵敏度和高分辨率的蛋白质组学研究分析平台。鸟枪法蛋白质组鉴定是蛋白质组研究最重要的研究策略:通过实验产出串联质谱数据,通过搜索蛋白质序列数据库获得可靠鉴定肽段结果,并进一步通过蛋白质的推导获得鉴定蛋白质结果。由于质谱数据的特性,生物样品多样、实验过程复杂、现有搜索算法和质量控制方法局限,尽管数据库搜索策略可以提高生物质谱数据的解析效率,但仍不能完全解决蛋白质鉴定问题。如何保证鉴定结果的正确性和完整性,是数据库搜索策略的主要问题。随着质谱仪不断发展,海量高精度质谱数据不断产出,大规模蛋白质组质谱数据研究的分析方法明显滞后。质谱数据分析的瓶颈,已经不再是实验数据的产出,而是数据的有效分析。因此建立质谱数据分析平台,实现大规模质谱数据分析自动化实现十分必要。另一方面,高精度串联质谱(MS/MS)数据所蕴含的肽段信息可为基因组解析注入新的思路,从高精度MS/MS数据出发,利用基因组数据库搜索,可进一步提高质谱数据解析率。蛋白质组基因组学的研究理念是整合串联质谱数据注释基因组蛋白质编码基因。本课题致力于基于数据库搜索策略的质谱数据分析流程的改善、平台构建及其在人类肝脏蛋白质组等大规模数据分析中的应用。首先比较谱图、肽段、蛋白质水平质量控制方法的严格性,并开发了针对Mascot搜索引擎的质量控制和蛋白质装配程序ProDistiller;然后探索了常用蛋白质序列数据库的区别及其对对鉴定结果的影响,并依据我们实验室长期的数据分析经验,整合质谱数据分析软件、构建质谱数据分析平台Mass Spectrum Data Processing Pipeline(MSPP)。基于研究发展的质控方法和数据分析平台,我们对人类染色体蛋白质组计划产出以及收集的人类肝脏蛋白质组的海量数据集展开了系统的分析。最后我们建立了基于基因组数据库和预测蛋白质组数据库挖掘新蛋白的数据分析流程,实现了海量人类蛋白质组质谱数据的深度解析。具体内容包括:蛋白质水平质控方法是较谱图水平、肽段水平质控更为严格的质量控制方法。尤其对于复杂样本数据集,整合实验数据多,蛋白质水平累积的假阳性鉴定也多。我们开发基于PepDistiller结果进行蛋白质水平质量控制和蛋白质装配的ProDistiller程序,设置图谱打分F-value,对同一个样本的图谱结果进行排序逐个组装蛋白,在蛋白水平FDR达到1%时停止组装获得卡值,蛋白质装配基于简单原则法。ProDistiller使用Perl语言编写,可以在多种平台下运行,结果中保留肽段鉴定的属性,如电荷,漏切位点数,母离子和子离子质量误差等。目前常用蛋白质组序列数据库有NCBI nr、UniProt、RefSeq、Ensembl等,这几个数据库在理论肽段构成上基本相似,差别在于存着不同可变剪接形式的蛋白质。注释较好的Uniprot和SwissProt数据库所得到的鉴定结果要比其它数据库多。另一方面Uniprot和Swiss Prot数据库大小远小于Ensembl数据库、RefSeq数据库和NCBI nr数据库,对计算所需硬件和时间需求较小。因此我们建议在常规的蛋白质组质谱鉴定的数据库搜索中,数据质量高、冗余度低的Uniprot和Swiss-Prot数据库是最佳选择,以基因为中心的研究可采用Swiss-Prot为搜索数据库。质谱数据分析平台(MSPP)有效整合并实现了多种搜索引擎搜索、多水平质控和整合、有标/无标定量等多个功能模块,并考虑了多节点调度和任务分配,能够满足海量数据处理的需求。该平台已成功地应用于中国人类蛋白质组计划、人类染色体蛋白质组计划和人类肝脏蛋白质组数据集的数据分析中,至今已累积处理超过4亿张谱图。随着蛋白质组质谱技术的高速发展,数据规模逐渐增大,大规模高通量自动化分析,高性能计算平台需要进一步优化任务调度、数据分发和结果收集,建立高通量、自动化的串联质谱数据的新蛋白质鉴定平台。MSPP成功应用于人类染色体蛋白质组计划中复杂样本的数据分析。我们对三组具有不同转移潜能人类肝癌细胞系样本Hep3B,HCC97H和HCCLM3进行转录组、翻译组和蛋白质组的深度测序分析,蛋白质组学鉴定9064个基因,是翻译组基因总数的50.2%。其中通过转录因子富集策略,鉴定到31个低丰度蛋白质,证明富集策略对低丰度蛋白鉴定的有效性。通过样本特异性数据库搜索,我们发现SAP只占总鉴定肽段数目的0.4%,这表明单一氨基酸多态性对蛋白质鉴定影响很小。为获得最完整的人类肝脏蛋白质组数据集,我们系统收集尽可能完整肝脏相关的质谱数据,记录样品状态,获得最完整的肝脏质谱数据第一版。实验数据按照样本类型分为成人肝、胎肝和肝癌细胞系三种。使用MSPP用于肝脏质谱数据重分析,构建最新版高可信的人类肝脏蛋白质组数据集,共鉴定9901个基因,鉴定结果远远高过PeptideAtlas中的现有人类肝脏数据集的数据量(4,408个蛋白质)。与SwissProt和ProteinAtlas中的肝脏组织特异性表达谱数据比较,发现仍有大量漏检蛋白质。分析其鉴定谱图的打分情况发现,很多鉴定图谱并不是打分值低被过滤,而是具有较好打分,导致鉴定结果存在大量的假阴性。我们建立了基于基因组数据库的数据分析流程,初步实现了海量人类蛋白质组质谱数据的深度解析。使用高精度质谱数据搜索基因组数据库(理论外显子连接体数据库)和预测蛋白质AceView数据库,我们发现了一些图谱高可信的候选结果,包括5条可能是新AS的肽段和3条新蛋白肽段。虽然结果仍需要进一步实验验证,但此次试验证明了基于质谱数据注释基因组的可行性,确定了分析方法。
[Abstract]:Proteomics is one of the hotspots in the research of life science in the post genome era. It studies the protein expression of cell, organ and tissue in organism, and clarifies its biological significance. One of the important techniques in the study of proteomics is the bio mass spectrometry technology, which has promoted the development of biomass spectrum technology and promoted the large-scale protein research. A proteomic analysis platform for high flux, high sensitivity and high resolution is carried out. The identification of the proteome of the bird gun method is the most important research strategy in the study of the proteome. Through the experimental production of tandem mass spectrometry data, the results of the peptide segment can be reliably identified by searching the protein sequence database and further through the protein. The results of protein identification are obtained. Because of the characteristics of the mass spectrometry data, the biological samples are diverse, the experimental process is complex, the existing search algorithms and the quality control methods are limited. Although the database search strategy can improve the analytical efficiency of the biological mass spectrometry data, the problem of protein identification can not be completely solved. How to ensure the positive results of the identification is guaranteed. Accuracy and integrity are the main problems of database search strategy. With the continuous development of mass spectrometers, mass high-precision mass spectrometry data are continuously produced, and the analysis method of mass mass spectrometry data analysis is obviously lagging behind. The bottleneck of mass spectrometry data analysis is no longer the output of experimental data, but the effective analysis of data. Therefore, the data analysis is effective. On the other hand, the peptide information contained in the high precision tandem mass spectrometry (MS/MS) data can inject new ideas for genome analysis. From the high precision MS/MS data and the search of the genome database, the analysis rate of mass spectrometry data can be further improved. The idea of proteomic genomics is to integrate tandem mass spectrometry data to annotate genome protein coding genes. This topic is devoted to the improvement of mass spectrometry data analysis process based on database search strategy, platform construction and its application in large-scale data analysis of human liver proteome. First, the comparison of spectrum, peptide, protein The quality control method of quality control and the quality control of Mascot search engine and the protein assembly program ProDistiller are developed. Then the difference of the common protein sequence database and its influence on the identification results are explored, and the mass spectrometry data analysis software is integrated according to our laboratory data analysis experience for a long time. The mass data analysis platform Mass Spectrum Data Processing Pipeline (MSPP). Based on the research and development of the quality control method and data analysis platform, we carried out a series analysis of the mass data set of the human chromosome proteome production and the collection of human liver proteome. Finally, we established the number of genome based on the genome number. According to the data analysis process of mining and predicting proteome database, the depth analysis of mass human proteome mass spectrometry data is realized. The specific contents include: protein level quality control method is more than spectral level, peptide level quality control is more stringent quality control method. Especially for complex sample data set, integration experiment There are more data and more false positive identification of protein accumulation. We develop a ProDistiller program based on PepDistiller results for protein level quality control and protein assembly, set the map to score F-value, sort the results of the same sample by assembling the protein one by one, and stop the assembly when the protein level FDR reaches 1%. The protein assembly is based on the simple principle.ProDistiller, which is written in the Perl language, and can be run on a variety of platforms. In the result, the properties of the peptide segment identification, such as the charge, the number of missing bits, the mass error of the mother ion and the subions, are retained. The commonly used proteome sequence databases are NCBI NR, UniProt, RefSeq, Ensembl and so on. The database is basically similar in the composition of the theoretical peptide segments, and the difference lies in the proteins stored in different alterable splicing forms. The better Uniprot and SwissProt databases have more identification results than other databases. On the other hand, the Uniprot and Swiss Prot databases are much smaller than the Ensembl database, the RefSeq database and the NCBI NR data. Libraries are less required for hardware and time for computing. So we recommend that in a database search for conventional proteome mass spectrometry identification, data quality is high, Uniprot and Swiss-Prot databases with low redundancy are the best choice. Swiss-Prot is the search database based on the center research. The mass spectrometry data analysis platform (MSPP) is available. It has achieved many search engine search, multi level quality control and integration, multi function modules, such as standard / scale-free, and multi node scheduling and task allocation, which can meet the needs of mass data processing. The platform has been successfully applied to the Chinese human proteome plan, human chromosome proteome plan and human being. In the analysis of the data set of the liver proteome data, more than 400 million spectra have been processed so far. With the rapid development of the protein mass spectrometry technology, the scale of the data is increasing, the large-scale high throughput automation analysis. The high performance computing platform needs to further optimize the task scheduling, the data distribution and the result collection, the establishment of high flux and automation. .MSPP, a new protein identification platform for tandem mass spectrometry, has been successfully applied to data analysis of complex samples in the human chromosome proteome program. Three groups of human hepatoma cell lines with different metastatic potential, Hep3B, HCC97H and HCCLM3, were transcribed, the deep sequencing analysis and proteomics of the translation group and proteome 9064 genes were identified, which were 50.2%. of the total number of genes in the translation group. 31 low abundance proteins were identified by the transcription factor enrichment strategy. It was proved that the enrichment strategy was effective for the identification of low abundance proteins. We found that only 0.4% of the total number of peptide segments were identified by the sample specific database search. This indicates that the single amino acid polymorphism is found to be 0.4%. In order to obtain the most complete data set of the human liver proteome, we systematically collect the complete liver related mass spectrum data, record the state of the sample, and obtain the first version of the most complete liver mass spectrometry data. The experimental data are divided into three types, adult liver, fetal liver and liver cancer cell line according to the sample type. Using MSPP For the liver mass spectrometry data reanalysis, the latest and highly trusted human liver proteome data set was constructed, and 9901 genes were identified. The results were much higher than the amount of data (4408 proteins) of the existing human liver data set in PeptideAtlas. Compared with the specific expression profiles of liver tissue in SwissProt and ProteinAtlas, the results were compared. There are still a lot of leakage of protein. Analysis of the score of its identification spectrum, it is found that many identification atlas are not low score and filtered, but have good scores, resulting in a large number of false negative results. We have established the data analysis flow based on the genome database, and initially realized mass human proteome mass spectrometry. Depth analysis of data. Using high-precision mass spectrometry data to search the genome database (the exon connector database) and the predictive protein AceView database, we found some highly credible candidate results, including 5 new AS peptides and 3 new protein peptides, although the results still need further experimental verification, but This experiment proved the feasibility of annotation genome based on mass spectrometry data and identified the analysis method.
【学位授予单位】：中国人民解放军军事医学科学院
【学位级别】：博士
【学位授予年份】：2017
【分类号】：Q51;Q811.4

【相似文献】