大规模病原体特征序列检测
发布时间:2018-04-19 19:10
本文选题:病原体 + 检侦 ; 参考:《重庆大学》2015年硕士论文
【摘要】:随着人们生活水平的提高,人类健康和安全问题也越来越受到关注。而近年来各类新型流行疾病的出现,使人类的健康及生命安全正在受到前所未有的威胁。这些流行疾病大多都是常见的细菌、病毒等病原体通过变异,重组,进化等方式隐藏在宿主身上,经过长时间宿寄使人患病。这类疾病,一般早期难以发现分离病原体,并且疫情一旦爆发很难控制,给人类的生命安全带了极大的威胁。如果能在流行疾病发病早期通过一定的检测技术,能缩小可疑感染病原体的鉴别范围,可为后续病原体分离鉴定、血清学检测、临床诊断和对症治疗提供方向。目前,广泛使用的高通量检侦技术,尚不成熟,并且国内尚缺乏自主的快速高通量病原体检侦的能力。要具备自主的快速高通量病原检侦能力,关键在于对病原体核酸序列进行高效识别。即能快速得定位到环境样本中的目标物种。本文提出快速的方法计算出病原体的特征序列,病原体特征序列是可用于来识别该物种在背景序列的存在性和区别物种的特异性的一组核苷酸序列,即可以唯一代表该目标物种。本文首先收集整理常见的病原体微生物的种属关系列表,通过NCBI数据库序列标识及种属关系在知名的数据库中下载原始序列数据。下载之后可以根据自己的实验的需求对序列加以描述和注释,构建出病原体全序列数据库。之后通过全基因组比对算法MUMmer进行比对,并利用集群中任务调度系统使比对过程并行化,加快比对效率,比对之后建立全基因组序列匹配信息库。基于中间匹配信息库,可以在线性的时间内计算出特征序列。本文运用在目标物种序列匹配信息中求交集,来得到目标物种中共享的序列;在与背景序列匹配信息中求并集,求出相对于背景序列特异的序列。目标中共享而背景中特异的序列即为病原体的特征序列。本文实验比对的过程背景物种数相对很少,得到的特征序列在环境中并不能唯一代表该物种,故需把得到的特征序列经过进一步的计算筛选。本文采用Blast工具筛选,即把得到的特征序列在核酸序列数据库NT库中,作相似性搜索。通过解析Blast输出文件,并设定qcovhsp阈值,来实现特征序列的筛选。特征序列筛选之后就可以根据种属关系建立病原体特征序列数据库,方便生物信息研究人员及医疗人员查询使用。
[Abstract]:With the improvement of people's living standard, human health and safety are paid more and more attention. In recent years, the emergence of various new types of epidemic diseases, human health and life safety is under unprecedented threat. Most of these epidemic diseases are common bacteria, viruses and other pathogens through mutation, recombination, evolution and other ways hidden in the host, after a long stay to make people sick. This kind of disease is difficult to detect and isolate in the early stage, and once the outbreak is very difficult to control, it poses a great threat to the safety of human life. If a certain detection technique can be adopted in the early stage of epidemic disease, it can reduce the range of identification of pathogens suspected of infection, which can provide the direction for the subsequent isolation and identification of pathogens, serological detection, clinical diagnosis and symptomatic treatment. At present, the widely used high-throughput detection technology is not mature, and the ability of rapid high-throughput pathogen detection is still lacking in China. In order to have the ability of rapid and high throughput pathogen detection, the key lies in the efficient identification of pathogen nucleic acid sequences. That is, the target species in the environmental samples can be located quickly. In this paper, a rapid method is proposed to calculate the characteristic sequences of pathogens, which are a set of nucleotide sequences that can be used to identify the existence of the species in the background sequence and to distinguish the specificity of the species. That is, it can only represent the target species. In this paper, we first collect and sort out the species relationship list of common pathogens, and download the original sequence data from the well-known database by NCBI database sequence identification and species relationship. After downloading, we can describe and annotate the sequence according to the requirements of our experiments, and construct the database of the whole sequence of pathogens. Then the whole genome alignment algorithm (MUMmer) is used to carry out the alignment, and the task scheduling system in the cluster is used to parallelize the alignment process to speed up the alignment efficiency. After the alignment, the whole genome sequence matching information database is established. Based on the intermediate matching information base, the feature sequence can be calculated in linear time. In this paper, we use the intersection of the target species sequence matching information to obtain the shared sequence in the target species, and the union between the matching information and the background sequence matching information to find the specific sequence relative to the background sequence. The shared sequence in the target and the specific sequence in the background are the characteristic sequences of the pathogen. In this paper, the number of background species in the process of experimental alignment is relatively small, and the obtained characteristic sequences do not represent the species only in the environment, so the obtained characteristic sequences need to be further calculated and screened. In this paper, Blast tools are used to screen the obtained feature sequences in NT database of nucleic acid sequences for similarity searching. By parsing the Blast output file and setting the qcovhsp threshold, the feature sequence can be filtered. The database of pathogen characteristic sequence can be established according to the relationship between species and genus after screening of characteristic sequence, which is convenient for biological information researchers and medical personnel to inquire and use.
【学位授予单位】:重庆大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:R440
【相似文献】
相关会议论文 前1条
1 李炳军;刘思峰;;基于多行为特征序列与多层相关因素的灰色关联分析[A];2006年灰色系统理论及其应用学术会议论文集[C];2006年
相关硕士学位论文 前3条
1 王松建;大规模病原体特征序列检测[D];重庆大学;2015年
2 高雷;基于特征序列和CGR方法对Rh血型基因特征的研究[D];江南大学;2009年
3 李华;SVM中多项式核函数的修改及Exon-Intron特征序列的研究[D];北京工业大学;2001年
,本文编号:1774349
本文链接:https://www.wllwen.com/huliyixuelunwen/1774349.html
最近更新
教材专著