混合测序实验设计及数据分析
发布时间:2018-01-17 22:28
本文关键词:混合测序实验设计及数据分析 出处:《东南大学》2017年博士论文 论文类型:学位论文
更多相关文章: 混合测序 群试 组合测序 稀有突变 稀有单倍型 个体单倍型构建 单核苷酸多态性
【摘要】:DNA测序技术最早可追溯至20世纪50年代通过化学降解测定多聚核糖核苷酸序列的方法。经过几十年的努力,DNA测序技术空前发展,测序成本下降巨大。随着第二代和第三代高通量测序技术的商业化,人类基因组测序成本已经降至一千美元。目前,测序技术正向着高通量、低成本、长测序片段的方向快速发展。尽管测序成本显著下降,但对大量的个体进行全基因组测序依然十分昂贵,其所面临的主要挑战是对大量DNA样本进行扩增以及文库构建所带来的巨大成本。为了充分利用当前测序的超高通量,混合测序应运而生,即将多个样本混合在一起进行一次测序。混合测序的一个主要问题是各样本测序数据混合一起,需要采用Barcode技术以确定每条测序片段来自于哪个样本。由于高通量测序技术中测序片段长度的限制,Barcode序列必须非常短,因此该技术能够编码的样本数是非常有限的,而且对大量样本进行特异序列连接也十分费时费力。2009年,Patterson等人提出了一种新的混合测序设计:组合测序,即对大量样本进行组合混合并测序。在组合测序中,每个样本被混合到多个混合池中,以样本的混合模式作为一种编码,用来标记每个样本。在测序完成之后,利用一定的解码方法,根据样本的混合模式获得属于每个样本的测序数据。与普通混合测序相比,组合测序还涉及编码与解码过程。编码是指组合混合过程,即设计混合方案以保证每个样本具有独特的混合模式。解码是指根据样本的混合模式从混合测序结果中获得属于每个样本测序数据的过程。本课题主要围绕混合测序特别是组合测序的实验设计和数据分析展开研究,首先构建了组合测序的优化设计方案,随后将其应用于稀有突变携带者筛选、稀有单倍型携带者筛选以及个体单倍型构建实验中,最后发展了一种基于两核苷酸实时合成测序技术的混合样本单核苷酸多态性检测方法,并使用真实混合测序实验数据进行了验证。本论文主要包括以下内容:1.设计并优化用于筛选稀有突变携带者的组合测序方案。首先构建出混合测序的最优测序深度模型以及组合测序的成本模型。然后使用群试领域中的混合矩阵设计,选择最优的设计参数以最大程度地降低测序成本并保证稀有突变携带者识别的准确度。考虑到混合样本中样本个数的限制以及混合测序所需要的超高深度,将大规模样本分成数个小组并对每个小组进行独立的组合测序会进一步降低稀有突变携带者筛选的成本。模拟结果表明限定测序区域长度为30Mb时,与对个体进行独立测序的方案相比,使用优化的组合测序从200个二倍体样本中筛选1%的稀有突变携带者将会使成本降低至52%。为了利用混合测序结果中的定量信息,即携带突变的测序片段个数信息,借助于群试领域中的定量设计,我们提出了一种从大规模样本中更高效的筛选出稀有突变携带者的组合测序方案。该方案使用随机k-set矩阵来组合混合样本,并设计了一个指示概率值以评价混合矩阵的性能。最终,使用启发式贝叶斯解码算法来识别突变携带者。利用公开可用的真实测序片段和人工模拟的混合测序结果,我们模拟了组合测序以从200株大肠杆菌中筛选出携带有稀有突变的菌株。结果显示,该方案能够准确地鉴定出91.5%-97.9%的稀有突变的携带者,其中稀有突变的频率变化范围为0.5%-1.5%。与基于普通群试方案的组合测序和已发表的压缩测序方法相比,基于定量群试的组合测序方案表现更优,尤其是在降低测序数据需求量以及实验成本上。2.发展了一种混合样本中单倍型频率估计及稀有单倍型携带者识别算法。借助于包含已知单倍型信息的先验数据库,我们提出了Ehapp来从混合测序结果中估计数据库中各单倍型的比例。Ehapp将混合样本中单倍型频率估计问题转换为对线性系统求稀疏解的问题并利用压缩感知领域中的稀疏信号重构算法求解。当对包含10个单倍型的混合样本进行50×深度的测序时,Ehapp估计的各单倍型比例的相对误差在3%左右。即使当混合样本中含有未知单倍型时,Ehapp依然能够对混合样本中含量高于0.05的已知单倍型的比例进行准确的估计。使用模拟测序结果以及公开可用的真实测序结果进行模拟,与现有算法相比,Ehapp在许多测序实验设计中会表现更优。通过使用Ehapp来估计混合样本中各单倍型的比例,我们也揭示了利用组合测序筛选稀有单倍型携带的可行性。在Ehapp的基础之上,我们进一步进行升级并提出了Ehapp2。与Ehapp不同的是,Ehapp2不再以单个SNP为基本单元,而以固定长度内的局部单倍型为基本单元。此外,Ehapp2还使用期望最大化算法来估计局部单倍型的比例,该算法能够有效的利用测序质量值以降低测序错误的影响。大量模拟实验显示Ehapp2对测序错误不敏感,即使当测序错误率达到0.05的时候,对包含10个单倍型的混合样本进行50×深度的测序,Ehapp2估计的各单倍型比例的误差依然保持在3%左右。此外,由于Ehapp2以局部单倍型而非单个SNP作为基本的计算单元,所以Ehapp2能够准确估计重组的单倍型的比例。与Ehapp和现有算法Harp进行比较的结果也显示,Ehapp^2表现更优,而且更适用于当前的第二代高通量测序技术。Linux平台下Ehapp和Ehapp2的下载地址分别为http://bioinfo.seu.edu.cn/Ehapp 和 http://bioinfo.seu.edu.cn/Ehapp2。3.构建了一种基于组合混合克隆测序的个体单倍型构建方案。对个体构建克隆文库之后,采用一种随机矩阵设计混合方案对大量的克隆进行组合混合并测序。随后根据组合测序中每个克隆的混合模式,恢复出携带每个等位基因的所有克隆,从而恢复出每个克隆所携带的所有等位基因,即重构出克隆序列。最后,利用个体单倍型组装软件HapCUT连接各克隆以重构出个体单倍型序列。基于个体NA12878的二倍体基因组,我们模拟组装出1号染色体的单倍型。最终组装的单倍型序列中共有112条contig序列,N50长度为3.4Mb,且不包含翻转错误。与现有方法相比,我们的方法具有更高的准确度。为了使该方法更容易使用,我们也编写了相应的流程,具体的下载地址为http://bioinfo.seu.edu.cn/OPShap。4.提出了一种基于两核苷酸实时测序技术的混合样本单核苷酸多态性检测方法。针对东南大学专利测序技术——两核苷酸实时测序技术,我们提出了一种从混合DNA样本检测单核苷酸多态性的方法(Epds)。根据野生型序列与突变型序列信号谱之间存在的五种差异类型,我们采用枚举算法来推测突变位置并估计对应的突变序列的比例。使用三种两核苷酸添加方案,Epds能够进一步识别出突变碱基。大量模拟实验证明,当测序信号变异系数固定为0.0016时,从混合样本中检测比例高于0.02的单核苷酸多态性突变,Epds的准确度能够达到89%以上。结果还显示,Epds的假发现率仅仅为3%。与现有基于单核苷酸添加测序技术的混合样本单核苷酸多态性检测方法相比,Epds具有更好的表现。最终,我们实施真实混合测序实验进行验证的结果表明Epds能够有效的应用于从混合样本中检测单核苷酸多态性。我们编写出了 Epds 的代码并公开在 http://bioinfo.seu.edu.cn/Epds。
[Abstract]:DNA sequencing technology can be traced back to 1950s, the determination method of polyribonucleotide sequence by chemical degradation. After decades of efforts, the unprecedented development of DNA sequencing technology, the cost of sequencing the huge decline. With the second and third generation high-throughput sequencing technology commercialization, the human genome sequencing costs have dropped to $one thousand. At present, sequencing technology is a high throughput, low cost, rapid development of long fragments direction. Although sequencing costs decreased significantly, but a large number of individual whole genome sequencing is still very expensive, the major challenges facing the huge cost brought by the construction of the library and the amplification of a large sample of DNA. In order to make full use of ultra high throughput the sequencing of mixed sequencing emerged, i.e. multiple samples mixed with a sequencing. A major problem is the variety of mixed sequencing The sequencing data mixed together, need to determine each fragment from which samples using Barcode technology. Because of high throughput sequencing technology in sequencing fragment length limit, Barcode sequence must be very short, so the technology can sample number encoding is very limited, and a large number of samples for sequence specific connection is also very time-consuming.2009 in 2008, Patterson et al. Proposed a new design of hybrid combination of sequencing, sequencing: the combination of mixing and sequencing of a large number of samples. The combination of sequencing, each sample was mixed into a mixing tank, a kind of encoding as in mixed mode sample, used to mark each sample. In sequence after the completion of the decoding method of sequencing data were obtained for each sample belongs to mixed mode according to the sample. Compared with the ordinary hybrid combination of sequencing, sequencing also involves encoding and decoding The process of encoding. Refers to the mixing process, namely the design of hybrid scheme to ensure that each sample has a unique mixed mode refers to the process of decoding. Each sample belongs to the sequencing data from the sequencing results in mixed samples. According to the mixed mode around the main topic of mixed sequencing especially experimental design and data analysis of the combination of sequencing first of all, build the optimal design of the combined sequencing, then applied to rare mutation carriers screening, screening rare haplotype carriers and individual haplotypes were constructed in the experiment, finally developed a mixed sample method for detection of single nucleotide polymorphisms of two nucleotide sequencing technology based on real-time synthesis, and verified using real mixed sequencing experimental data. This paper mainly includes the following contents: 1. design and Optimization for screening rare mutation carriers sequencing party Case. We build a hybrid sequencing optimal sequencing depth model and combined the cost of sequencing model. Then use the group to test the design of the mixing matrix in the field, select the optimal design parameters to minimize the cost of sequencing and ensure the accuracy of the identification of rare mutation carriers. Considering the number of samples in mixed sample and mixed constraints sequencing required high depth, combination of large-scale sample sequencing will be divided into several groups and independent of each group will further reduce the cost of screening rare mutation carriers. The simulation results show that the limited sequence length of 30Mb region, compared with the independent sequencing of individual programs, screening 1% from 200 diploid samples using a combination of sequencing optimization of rare mutation carriers will reduce the cost to 52%. in order to use quantitative information mixed sequencing results, namely carrying The number of mutations in the fragment sequencing information, quantitative test by means of the group in the field of design, we propose a large-scale sample selected from more efficient combination of rare mutation carriers sequencing scheme. The scheme using random k-set matrix mixed samples, and design a performance evaluation indicator probability values to the mixing matrix finally, using a heuristic Bayesian decoding algorithm to identify the mutation carriers. Using mixed sequencing results publicly available real fragments and artificial simulation, we simulated the combined sequencing to from 200 strains of Escherichia coli were selected with rare mutation strains. The results show that the scheme can accurately identify 91.5%-97.9% rare mutations the carriers, the frequency range of rare mutations in 0.5%-1.5%. and combination group testing scheme and sequencing of common published sequencing method based on compression Compared to group based on quantitative test sequencing scheme has better performance, especially the development of a mixed sample estimation of haplotype frequencies and rare haplotype carriers recognition algorithm in reducing the demand and the cost of sequencing data on.2.. With the help of the prior information contains a known haplotype database, we propose Ehapp from mixed sequencing the estimated haplotype database in the proportion of.Ehapp mixed samples in haplotype frequency estimation problem into a linear system of sparse solution problem and solved using sparse signal reconstruction algorithm of compressed sensing field. When sequencing mixed sample of 10 haplotypes was 50 x depth, the relative error of Ehapp estimation of the haplotype ratio around 3%. Even when the mixed samples containing unknown haplotypes, Ehapp can still on the content in mixed sample has higher than 0.05 The proportion of single times of accurate estimates. Using simulation results and real sequencing sequencing publicly available results are simulated, compared with the existing algorithms, Ehapp sequencing in many experimental design will perform better. Through the use of Ehapp to estimate the haplotypes of mixed sample than in the cases, we also reveal the feasibility of screening rare haplotypes carrying using a combination of sequencing. On the basis of Ehapp, we further upgrade and put forward Ehapp2. and Ehapp is different, Ehapp2 is no longer a single SNP as the basic unit, and partial haplotype fixed length in the basic unit. In addition, Ehapp2 also uses the expectation maximization algorithm to estimate the local haplotype proportion, the the algorithm can effectively use the sequencing quality value in order to reduce the impact of sequencing errors. A large number of simulation experiments show that Ehapp2 is not sensitive to sequencing errors, even when the sequencing error rate Up to 0.05 of the time, sequencing 50 x depth of the mixed sample contains 10 haplotypes, Ehapp2 estimated the error percentage of haplotype still at about 3%. In addition, due to the local Ehapp2 haplotype rather than a single SNP as the basic computing unit, so Ehapp2 can accurately estimate the recombination proportion of haplotypes. Compared with the results of Ehapp and Harp algorithms also show that Ehapp^2 has better performance, but also applies to the current second generation high-throughput sequencing technology on the platform of.Linux Ehapp and Ehapp2 http:// respectively bioinfo.seu.edu.cn/Ehapp download address and http://bioinfo.seu.edu.cn/Ehapp2.3. to build a construction program of mixed clone sequencing based on individual haplotypes. After constructing library the individual, using a stochastic matrix design for mixed and mixed with the sequencing of a large number of clones. According to the mixed mode of each combination of sequencing clones, recovered all the clones carrying each allele, in order to retrieve each clone carrying all alleles that reconstruct the clone sequence. Finally, the assembly software HapCUT connecting the clone to reconstruct individual haplotypes using individual haplotypes. Individual NA12878 diploid genome based on our simulation of assembled chromosome 1 haplotypes. The final assembly of the haplotypes are 112 contig sequences, N50 was 3.4Mb in length, and contain no flip error. Compared with the existing methods, our method has higher accuracy. In order to make the method more easy to use, we also write the corresponding the specific process, the download address is http://bioinfo.seu.edu.cn/OPShap.4. a two nucleotide sequencing real-time mixed sample detection based on single nucleotide polymorphism Test method for patent. Southeast University sequencing technology real-time two nucleotide sequencing technology, we propose a method of mixed DNA samples from the detection of single nucleotide polymorphisms (Epds). According to the five different types of wild type and mutation type sequence sequence between the signal spectrum exists, we use the enumeration algorithm to speculate and estimate the mutation sequence mutation position the corresponding ratio. Using three two nucleotide addition scheme, Epds can further identify mutations. Simulation results demonstrate that when the sequencing signal variation coefficient is fixed at 0.0016, higher than 0.02 of the SNP mutation from the detection of mixed samples, the accuracy of Epds can reach more than 89%. The results also showed that Epds the false discovery rate is only 3%. and the existing mixed sample detection method of single nucleotide polymorphism single nucleotide sequencing technology based on adding compared with Epds Finally, we implemented the real mixed sequencing experiment to verify that the Epds can be applied to detect single nucleotide polymorphisms from mixed samples. We compiled the code of Epds and published it in http://bioinfo.seu.edu.cn/Epds..
【学位授予单位】:东南大学
【学位级别】:博士
【学位授予年份】:2017
【分类号】:Q78
,
本文编号:1438280
本文链接:https://www.wllwen.com/shoufeilunwen/jckxbs/1438280.html
教材专著