高通量测序数据中病毒基因组的生物信息学分析方法探索

发布时间：2018-07-30 07:47

【摘要】：病毒是一类只能够在活着的宿主细胞内复制的感染源。病毒个体微小、构造简单,除朊病毒(仅由蛋白构成)外,病毒均由一种作为遗传物质的核酸(DNA或RNA)与蛋白质构成。病毒种类多样,宿主范围广,具有细胞结构的生物均可以是病毒的宿主。病毒基因组作为病毒遗传信息的载体,是研究病毒的核心数据。随着高通量测序技术的普及,对病毒基因组进行高通量测序已成为研究病毒遗传、进化的主要手段。面对高通量测序产出的大量数据,就要求生物信息学分析能够尽可能多地挖掘出其中病毒基因组的有效信息。本文的研究目的即是探索出不同数据类型下,高通量测序数据中病毒基因组的生物信息学分析方法。本文从课题组积累的高通量测序数据及分析需求出发,探索了从高通量测序数据中挖掘病毒基因组中有效信息的分析方法。本文围绕病原微生物,分析其测序数据中病毒基因组的相关信息,具体分为两个部分:1、细菌高通量测序数据中溶原性噬菌体的分析;2、复杂测序样品中的病毒发现及基因组分析。细菌高通量测序数据中溶原性噬菌体的分析溶原性噬菌体是一类能够整合入宿主菌基因组中,随宿主菌的复制而传代的病毒。在某些条件的诱导下,也能够脱离宿主基因组,产生子代噬菌体释放出来。溶原性噬菌体的复制特性决定了它具有介导基因水平转移的功能,往往能够对宿主菌的致病性产生重要影响,如德国发现的肠出血性大肠杆菌O104:H4的主要毒力基因就是由前噬菌体所编码。本文以分离自足部溃烂病人的72株细菌基因组测序数据为研究对象,以溶原性噬菌体复制机制为理论模型,研究发现新的溶原性噬菌体基因组及其整合特征,为了解噬菌体的生物学特性及防控高致病性细菌感染提供基础。采用生物信息学软件与自编程序相结合的方式进行数据处理与分析。使用NGS QC Toolkit v2.3.3对原始测序数据进行质量控制,去除短读长及低质量数据。针对Ion Torrent平台数据特点,选择了商业软件Newbler v3.0作为数据组装软件。使用perl脚本编程,搭建前噬菌体预测分析流程,对组装得到的contigs进行前噬菌体预测。为得到活跃的前噬菌体基因组,选用两种辅助拼接工具,ContigScape插件显示组装后contigs之间的连接信息,商业软件CLC Genomics Workbench 9进行序列调整及拼接结果检查。使用实验室内部软件对contigs进行连接。同时使用RAST在线注释工具对得到的溶原性噬菌体基因组进行注释。最后,综合分析得到的溶原性噬菌体基因组结构、整合位点、进化关系等信息,挖掘其中的潜在信息。在72株细菌基因组数据中,共有11株细菌数据中发现了前噬菌体脱离细菌基因组进行复制的现象。对能够脱离细菌基因组进行复制的噬菌体序列进行拼接,共得到14个活化的前噬菌体全基因组序列,其中11株与目前已知的噬菌体序列同源性很低,为本文新发现的噬菌体序列。新序列的发现表明本文研究方法可用于新溶原性噬菌体的发现,增加科研人员对噬菌体的认知。分析发现,整合状态下噬菌体整合酶基因均与其整合位点紧邻。溶原性噬菌体的整合位点序列长短特征不一,但表现出与其整合酶具有相关性。同一整合位点可供多种具有相似整合酶的溶原性噬菌体整合,提供了前噬菌体预测的新思路。宿主为同一属内的细菌的溶原性噬菌体具有相似的基因组结构。复杂测序样品中的病毒发现及基因组分析由于病毒分离培养周期长,成功率低,我们常常要对一些复杂样品进行高通量测序,然后获取其中的有效病毒信息,这就给数据分析带来了一定的挑战。课题组近年来开展了使用高通量测序对临床样品进行病原检测的工作,要求数据分析能够快速准确地发现临床样品中的病原。目前单一的生物信息学软件不能满足我们对于复杂测序样品的分析需求,鉴于此开发了分析软件《高通量测序数据病原体归类分析软件v1.0》。该软件能够对细菌、真菌、原虫、病毒4种类型的病原进行检测,同时在应对复杂样品中已知或未知病毒的发现工作表现出良好的效果。复杂样品中已知病毒的发现,以2016年7月北京发现的输入性裂谷热病例为例。通过使用分析软件对测序数据分析,发现了大量的裂谷热病毒序列,确认了裂谷热病毒为致病原,并在第一时间获得了该株裂谷热病毒的全基因组序列。该株裂谷热病毒与2009年南非发现的Kakamas株同源性最高,进化分析提示该株病毒没有发生重组。复杂样品中未知病毒的发现,以勐海弹状病毒的发现为例。该株病毒分离自云南勐海地区捕获的白纹伊蚊,以C6/36细胞培养后,使用常见病毒引物无法鉴定出是何种病毒。通过对其高通量测序数据的分析,排除掉宿主细胞、其他细菌、病毒等干扰因素,获得了该株病毒的全基因组序列。序列分析显示其为一株新型的弹状病毒,命名为勐海弹状病毒,与发现于秘鲁的另外两株蚊媒弹状病毒最为相似。在对勐海弹状病毒的基因组分析中,本文还对选取的93株弹状病毒参考序列进行了病毒末端序列分析。发现其中的45株均具有短反向重复末端序列的特点,分布于不同的属中。狂犬病毒属内具有非常一致的末端序列“ACGCTTAAC”,而Ephemerovirus、Vesiculovirus、Tibrovirus和Sprivivirus四个属的病毒则均有“ACGAAGA”的一致末端序列。病毒基因组的末端序列常常与其基因组复制相关,其末端序列往往是相对严格的,这表明短反向重复末端序列很可能是弹状病毒科病毒基因组的一类特点。综上,本文在现有病毒基因组分析方法的基础上,提出了以细菌测序数据分析活化的前噬菌体全基因组及其整合位点的分析方法,能够用于新溶原性噬菌体发现,为了解溶原性噬菌体提供新知识。开发了高通量测序数据病原体归类分析软件,取得软件著作权,并在未知病原检测中发挥良好的作用。通过数据分析发现了一种新的弹状病毒,并对弹状病毒科基因组的末端序列特点做了分析。病毒基因组的分析,仍需针对不同的研究对象及分析需求设计分析方法,希望本文的方法及结论能够给其他科研人员提供参考和思路。
[Abstract]:A virus is a source of infection that can only be replicated in a living host cell. The virus is small and simple in structure. In addition to prion, the virus is made up of a nucleic acid (DNA or RNA) and protein as a genetic material. The virus is diverse, the host range is wide, and the cell structure organism can be the host of the virus. As the carrier of the genetic information of the virus, the virus genome is the core data of the virus. With the popularization of high throughput sequencing technology, the high flux sequencing of the virus genome has become the main means to study the virus heredity and evolution. The purpose of this paper is to explore the bioinformatics analysis method of viral genome in high throughput sequencing data under different data types. This paper, based on the high throughput sequencing data and analysis requirements accumulated by the group, explored the virus mining from high throughput sequencing data. Analysis of the effective information in the genome. This paper analyzes the related information of the virus genome in the sequencing data around the pathogenic microorganism, which is divided into two parts: 1, the analysis of the lytic phage in the high throughput sequencing data of bacteria; 2, the virus occurrence and genome analysis in the complex sequencing samples. Primary phage analysis of lytic phage is a kind of virus that can be integrated into the genome of host bacteria and is transmitted with the replication of host bacteria. Under some conditions, the phage can also be released from the host genome and produce the progeny phage. The replication characteristics of the lytic phage determine that it mediates gene level transfer. Function can often have an important effect on the pathogenicity of the host bacteria. For example, the main virulence gene of Escherichia coli O104:H4 found in Germany is encoded by the former phage. In this paper, the genome sequencing data of 72 bacterial strains isolated from the patients with self foot ulceration were studied, and the lysogen phage replication mechanism was used as the theoretical model. In order to solve the biological characteristics of phage and provide the basis for preventing and controlling the infection of highly pathogenic bacteria, the new lytic phage genome and its integration features are found. The data processing and analysis are carried out by the combination of bioinformatics software and self compiled program. NGS QC Toolkit v2.3.3 is used to control the quality of the original sequencing data. According to the characteristics of the Ion Torrent platform, the commercial software Newbler V3.0 is selected as the data assembly software. Using the Perl script programming, the pre phage prediction analysis process is built and the pre phage prediction of the assembled contigs is carried out. Two kinds of auxiliary phage genome are selected for the active pre phage genome. The splicing tool, the ContigScape plug-in displays the connection information between the assembled contigs, the commercial software CLC Genomics Workbench 9 for sequence adjustment and the splicing result check. Use the laboratory internal software to connect the contigs. At the same time, use the RAST online annotation tool to annotate the obtained lytic phage genome. In the data of 72 strains of bacterial genome, 11 strains of bacteriophage have been found in the data of 72 strains of bacterial genome, and the phage sequences that can be replicated from the bacterial genome are found. The whole genome sequence of 14 active phage was obtained. 11 of them have low homology with the known phage sequences, which are the new phage sequences found in this paper. The discovery of the new sequence shows that this method can be used for the discovery of new lytic phage and increase the cognition of the researchers to phage. In the integrated state, the phage integrase gene is closely adjacent to its integration site. The integration site sequence of the lytic phage is different, but it shows the correlation with its integrase. The same integration site can provide a variety of lylygentic phage integration with similar integrase, and provide a new idea for the prediction of the pre phage. The host is the same. The lytic phage of the bacteria in one genus has a similar genome structure. The virus discovery and genome analysis in the complex sequencing samples are long and the success rate is low because of the virus isolation and culture. We often have to sequence some complex samples by high flux and obtain the effective virus information. This brings the data analysis. A certain challenge. In recent years, the team has carried out the work of using high throughput sequencing to detect the pathogens in clinical samples, which requires data analysis to quickly and accurately detect the pathogens in clinical samples. A high throughput sequencing data classification software v1.0>., the software can detect 4 types of pathogens, bacteria, fungi, protozoa, and viruses, and good results in the discovery of known or unknown viruses in complex samples. The discovery of the virus in complex samples was found in Beijing in July 2016. A large number of Rift Valley fever virus sequences were found by analysis software, and the whole genome sequence of the Rift Valley fever virus was obtained at the first time. The Rift Valley fever virus was the most homologous to the Kakamas strain found in South Africa in 2009. The virus has not been reorganized. The discovery of the unknown virus in the complex sample was taken as an example of the discovery of the Menghai elastin virus. The virus was isolated from the Aedes albopictus, captured from the Menghai region of Yunnan, and could not be identified by common virus primers after the culture of the C6/36 cells. The whole genome sequence of the virus was obtained by removing the host cells, other bacteria and viruses. Sequence analysis showed that the virus was a new type of elastovirus, named Menghai elastovirus, which was the most similar to the other two mosquito borne viruses found in Peru. In the genome analysis of the Menghai elastovirus, this article also showed that The analysis of the virus terminal sequence of the selected 93 strains of ironlike viruses was carried out. It was found that 45 of them had the characteristics of short reverse repeating terminal sequences and distributed in different genera. The rabies virus has a very consistent terminal sequence "ACGCTTAAC" and four genera of Ephemerovirus, Vesiculovirus, Tibrovirus and Sprivivirus. The virus genome has a consistent terminal sequence of "ACGAAGA". The terminal sequence of the virus genome is often related to its genome replication and its terminal sequence is often relatively strict. This indicates that the short reverse repeating terminal sequence is very likely to be a kind of characteristics of the genome of the virus family virus. On the basis of the method, an analytical method for the analysis of the whole genome and its integrated site of the pre phage was proposed by the analysis of the bacterial sequencing data. It can be used for the discovery of the new lytic phage and provide new knowledge for the understanding of the lytic phage. The software of the high throughput sequencing data classification analysis software has been developed, and the software copyright is obtained, and the unknown disease is unknown. A new type of projectile virus was found through data analysis, and the characteristics of the terminal sequence of the genome of the family of ironavirus were analyzed. The analysis of the virus genome still needs to be designed and analyzed for different research objects and analysis requirements. It is hoped that the methods and conclusions of this paper can be given to other researchers. Provide reference and ideas.
【学位授予单位】：中国人民解放军军事医学科学院
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R373;Q811.4

【参考文献】