若干统计计算模型研究及其在生物医学信息处理中的应用
发布时间:2018-06-24 11:52
本文选题:胎儿心电图 + 集合经验模态分解 ; 参考:《山东大学》2016年博士论文
【摘要】:本课题来源于医学和生物学中的实际问题,主要研究利用时间序列分析、统计信号处理、统计机器学习和模式识别、Meta(荟萃)分析等方法构建了四个高效的统计计算模型,并利用这些模型进行了宫内胎儿心电图信号提取和去噪,真核生物蛋白质编码区识别,二代测序短序列大数据背景下的病毒预测以及酒精依赖症与NPY基因多态性的关联Meta分析等问题的研究.高精度胎儿心电图(Fetal electrocardiogram,FECG)在辅助医师监测胎儿在宫中变化情况并作出临床诊断方面具有非常重要的价值,然而在现实情况中,清晰的FECG却很难得到,这是因为在FECG中往往混杂着母体心电信号(Maternal ECG,MECG)和其他的噪声污染,如基线漂移,工频干扰及其他高频噪声等.在第一章中我们提出了一种新型的自适应综合算法用于母婴心电信号分离和FECG去噪,该算法集成了独立分量分析(Independent Composition Analysis,ICA),集合经验模态分解(Ensemble Empirical Mode Decomposition,EEMD)和小波收缩(Wavelet Shrinkage,WS)等算法优势.首先,我们利用独立分量分析(ICA)将胎儿心电信号(FECG)从腹部混合信号 (Abdominal ECG,AECG)中分离出来,从而得到含噪声的FECG其次,我们设计一个基于集合经验模态分解和小波收缩的综合算法对上一步得到的含噪FECG进行去噪.该算法包括EEMD分解,有用子信号统计信息量检验及其小波收缩处理,部分信号重构去除基线漂移等三个阶段.最后,我们采用模拟信号和真实信号进行测试,通过计算模拟信号去噪前后的信噪比(Signal-to-noise-ratio,SNR),均方误差(Mean Square Error,MSE)以及相关系数(R)对算法准确性评估.结果显示,我们提出的ICA-EEMD-WS综合算法优于传统信号分离和去噪方法。真核生物DNA序列的蛋白质编码区(外显子)能够在翻译过程中控制蛋白质的生成,对于生命进程具有极为重要的意义.在第二章中,我们将生物信息学中的蛋白质编码区识别(基因结构预测)转化为模式识别或分类问题进行处理.在真核生物DNA序列的蛋白质编码区(外显子)和非编码区(内含子)预测方面,前人已经提出了很多分类技术.其中,基于数字信号处理(digital signal processing, DSP)的离散傅里叶变换(discrete Fourier transform, DFT)因其具有不依赖于先验知识的优势在该领域取得了较大的成功.但是这类基于DFT的方法因为其谱分辨率低和谱能量泄露等本质性的不足,使其在短DNA序列预测方面迅速失去优势.第二章中,我们提出了一种新的基于自回归(autoregressiveAR)谱分析和小波包变换的(wavelet packets transform, WPT)的综合算法用于提升编码区识别效率和准确性.该算法首先利用一种DNA序列数值化方法(Code13 mapping method)将DNA序列转为数值序列,然后将此数值序列视为自回归模型的观测信号,利用高效的Marple算法通过计算Yule-Walker方程组的方法来估计自回归模型的能量谱密度(power spectral density, PSD)最后,利用能量谱密度在频率θ=27r/3处的值(也称为周期三特性,three-base periodicity(TBP) property)得到信噪比(SNR)曲线.对该信噪比曲线利用小波包变换算法去噪后,选取适当的阈值达到识别外显子区域的目的.最后,利用三个著名的标准测试集(GENSCAN65, HMR195和BG570)进行算法测试,结果显示,新算法较传统的基于DFT的方法能更加准确地识别出蛋白质编码区.病毒(尤其是致病病毒)已经威胁人类健康数千年而且近些年来新病毒及其变种不断出现,因此如何利用计算生物学技术协助医学专家在二代测序海量短序列数据库中快速缩小疑似病毒筛选范围,为其后续实验确诊病毒提供高质量候选对象、大幅度节省实验成本、提高新病毒应急反应能力和时效性,以及加快大规模疫苗研制和生成,挽救生命和减少感染人群等具有重要意义.第三章中,我们将序列比对与非序列比对方法相结合提出了一套综合分类算法用于病毒和人类的识别(分类)以及进一步的不同病毒类别预测.该算法首先采用BLAST技术将待分类序列分别与大型的病毒数据库和人类数据库进行比对,如果能够从中找到高度同源的目标序列,则该目标序列的类别即可视为待分类序列的类别,算法停止.对于那些比对不上的序列,我们提出的非序列比对方法就可以发挥补充作用,首先将待分类DNA序列转换为数值向量,将其作为支持向量机(Support vector machine, SVM)分类器的输入对其进行类别预测,得到其预测类别.如果被预测为”病毒”,将继续利用多分类随机森林(Random Forest, RF)进行病毒类别预测,即继续预测该”病毒”属于六种病毒类别中的哪一种.利用独立的8个测试集对我们提出的综合算法进行测试,并与其它预测方法进行比较.结果显示在病毒-人类分类效果方面具有较好的预测结果,尤其在较短的序列预测方面结果基本令人满意.在病毒水平的多分类预测中,尽管总体准确率不是很高,但是预测结果可以作为生物学家进一步的参考.总之,本研究能够帮助生物学家和医学专家进行NGS短序列海量数据的大幅度筛选,从而大大缩小候选病毒序列的范围,有助于提升病毒尤其是病原性病毒的识别确认效率,为治疗和预防重大流行性传染病提供有力的技术支撑.酒精依赖症(Alcohol dependence,AD)是一种典型的慢性酒精中毒,是由于长期反复饮酒所致的对酒的一种特殊的心理状态.1990-2010的20年间,在全球所有疾病风险因素中,饮酒已经从原来的第6位快速上升为第3位,仅次于高血压和二手烟.过度饮酒不仅导致与健康相关的损害,而且会带来社会伤害,如交通事故、犯罪、虐待儿童、家庭暴力及各种形式的伤害等.因此,饮酒相关问题已经将会成为包括我国在内的全球重要的公共卫生问题之一.尽管酒精依赖症的发病率持续增加,但是其确切的病因和发病机理目前仍不完全清楚.目前研究认为AD是与遗传和环境等多因素有关的复杂精神疾病,而且大量研究已经证实酒依赖症与遗传因素密切相关.在神经肽Y(NPY)基因多态性与酒精依赖症之间关联性的研究方面,各国研究人员已经在全球不同人群中进行了十多年的研究,但是在两个主要单核苷酸多态性(SNP),rs16139和rs16147位点,研究结果却呈现出不吻合,甚至完全相反的结论,以至于与AD相关的易感基因尚未最终定论.这是因为不同人群,不同种族间遗传背景和环境影响因素的不同,导致同一基因在不同人群,不同种族之间等位基因及基因型频率可能存在差异,故而对同一疾病发生的影响也可能存在差异.如何利用现有随机病例对照研究资料寻找酒精依赖症的易感基因,从基因水平筛选高危人群并为其有针对性地提供早期干预,诊断,实现个性化治疗具有重要的临床应用价值和社会效益.鉴于现有关于NPY基因多态性与酒精依赖症关联性研究中出现了研究结果不一致的情况,第四章中,我们主要围绕NPY基因多态性与酒精依赖症之间是否存在显著的相关性问题,利用SNP的Meta分析方法对目前已经发表的关于神经肽Y(NPY)基因多态性,尤其是两个重要SNP(rs16139 口rs16147)与AD发病风险的流行病学文献进行定量分析和综合评估.本章我们严格按照SNP的Meta分析方法的基本要求,通过广泛收集现有国内外高质量研究文献,将现有的关于NPY基因多态性与酒精依赖症关联性的文献进行定量综合分析.首先对对照组进行哈迪-温伯格遗传平衡定律(Hardy-Weinberg equilibrium,HWE)平衡检验,随后进行各研究的异质性检验.上述检验通过后,利用基于Logistic回归模型的最佳遗传模型选择策略确定采用显性遗传模型来合并各研究的p值,并进行了亚组分析,最后利用漏斗图,Egger线性回归法和Begg秩相关法进行检验排除了发表偏倚.结果显示,大部分人群目前尚无充分证据表明NPY基因多态性与酒精依赖症之间存在显著的关联性.但是在亚组分析中发现个别人群(如芬兰人)的SNP rs16139与酒精依赖症具有相关性.本章对多个现有结果的Meta分析,从统计角度上增加了样本量,提高了检验效能,尤其是当多个研究结果不一致或都没有统计学意义时,采用meta分析可得到更加接近真实情况的综合分析结果,为临床医师和科研人员深入理解酒精依赖症的发病机理及其基因诊断和治疗提供了科学依据.第五章我们主要针对四个子课题的研究进行了总结,尤其深刻剖析了各研究存在的不足之处及原因分析,最后给出了今后研究的改进方案.
[Abstract]:This topic derives from the practical problems in medicine and biology, mainly using time series analysis, statistical signal processing, statistical machine learning and pattern recognition, Meta (meta) analysis and other methods to construct four efficient statistical computing models, and use these models to extract and denoise the intrauterine fetal electrocardiogram signal, and the true nuclear birth. Protein coding region identification, virus prediction in the two generation sequencing short sequence large data background, and the association Meta analysis of alcohol dependence and NPY gene polymorphism. High precision fetal electrocardiogram (Fetal electrocardiogram, FECG) is used to monitor fetal changes in the uterus and make clinical diagnosis by assisting doctors. It is very important, however, in reality, clear FECG is difficult to get, because in FECG, it is often mixed with the Maternal ECG (MECG) and other noise pollution, such as baseline drift, frequency interference and other high frequency noise. In the first chapter, we propose a new adaptive synthesis algorithm. The algorithm integrates Independent Composition Analysis (ICA), ensemble empirical mode decomposition (Ensemble Empirical Mode Decomposition, EEMD), and wavelet shrinkage (Wavelet Shrinkage). Firstly, we use independent component analysis (independent component analysis) to make fetal cardiac electrocardiogram (FECG). The number (FECG) is separated from the Abdominal ECG (AECG), and then the noise containing FECG is obtained. We design a comprehensive algorithm based on the set of empirical mode decomposition and wavelet contraction to denoise the noise containing FECG obtained in the last step. The algorithm includes EEMD decomposition, the statistical information test of useful subsignals and their wavelets. Three stages, such as shrinkage, partial signal reconstruction and baseline drift. Finally, we use analog signals and real signals to test the accuracy of the algorithm by calculating the signal to noise ratio (Signal-to-noise-ratio, SNR), mean square error (Mean Square Error, MSE) and correlation coefficient (R) before and after the de-noising of analog signals. The results show that, The ICA-EEMD-WS synthesis algorithm is superior to the traditional signal separation and denoising methods. The protein coding region (exons) of the eukaryotic DNA sequence (exons) can control the formation of protein in the process of translation. In the second chapter, we identify the protein coding region in bioinformatics (base). The structure prediction is transformed into a pattern recognition or classification problem. In the protein coding region (exons) and the non coding region (intron) prediction in the DNA sequence of eukaryotes, many classification techniques have been proposed. Among them, discrete Fourier transform (discrete Fo) based on the digital signal processing (DSP) Urier transform, DFT (DFT) has achieved great success in this field because of its advantages of not relying on prior knowledge. But this kind of DFT based method has lost its advantages in short DNA sequence prediction because of its low spectral resolution and spectral energy leakage. In the second chapter, we propose a new kind of self - based method. The integrated algorithm of regression (autoregressiveAR) spectrum analysis and wavelet packet transform (wavelet packets transform, WPT) is used to improve the efficiency and accuracy of the coding region recognition. Firstly, the algorithm uses a DNA sequence numerical method (Code13 mapping method) to turn the DNA sequence into a numerical sequence, and then the numerical sequence is considered as a autoregressive model. The observation signal is used to estimate the energy spectrum density (power spectral density, PSD) of the autoregressive model by using the efficient Marple algorithm to estimate the energy spectrum density of the autoregressive model (power spectral density, PSD). The signal to noise ratio (SNR) curve is obtained by using the value of the energy spectrum density at the frequency theta =27r/3 (also known as the period three characteristic, three-base periodicity (TBP) property). After denoising the signal-to-noise ratio curve using the wavelet packet transform algorithm, the appropriate threshold is selected to identify the exon region. Finally, three famous standard test sets (GENSCAN65, HMR195 and BG570) are used to test the algorithm. The results show that the new algorithm can identify the protein coding more accurately than the traditional DFT based method. Areas. Viruses (especially pathogenic viruses) have threatened human health for thousands of years and new viruses and their varieties have appeared in recent years. Therefore, how to use computational biology technology to help medical experts quickly reduce the range of suspected virus screening in the two generation sequencing massive short sequence database to provide high quality for its subsequent laboratory diagnosis of viruses. In the third chapter, we put forward a set of comprehensive classification algorithms used for viruses. And human identification (classification) and further different virus category prediction. First, the algorithm uses BLAST technology to compare the unclassified sequences to large virus databases and human databases. If a highly homologous target sequence can be found, the category of the target sequence can be considered as the category of the unclassified sequence. The algorithm stops. For those unmatched sequences, the non sequence alignment method we propose can play a complementary role. First, the DNA sequence to be classified is converted to a numerical vector, which is used as the input of the support vector machine (Support vector machine, SVM) classifier to predict the line category. "Virus" will continue to use Random Forest (RF) to predict virus category, which continues to predict which one of the six types of viruses. We use an independent 8 test set to test the integrated algorithm proposed by us and compare it with its prediction method. The results show that the virus humans are in the virus human. The results of classification have good prediction results, especially in shorter sequence prediction. In the multi classification prediction of virus level, although the overall accuracy rate is not very high, the prediction results can be used as a further reference for biologists. In conclusion, this study can help biologists and medical experts to enter into the study. A large scale screening of NGS short sequence mass data greatly reduces the range of candidate virus sequences, helps to improve the recognition and recognition efficiency of the virus, especially the pathogenic virus, and provides a powerful technical support for the treatment and prevention of major epidemic infectious diseases. Alcohol dependence (Alcohol dependence, AD) is a typical chronic alcohol. Poisoning is the 20 year of a special psychological state of alcohol caused by prolonged drinking. Among all the global risk factors for.1990-2010, drinking has risen rapidly from the original sixth to third, second only to hypertension and secondhand smoke. Excessive alcohol consumption not only leads to health related damage, but also causes social harm. Such as traffic accidents, crime, child abuse, domestic violence and various forms of injury. Therefore, drinking related issues have become one of the most important public health problems in the world, including our country. Although the incidence of alcohol dependence continues to increase, the exact etiology and pathogenesis are still not completely clear. AD is a complex mental disease associated with multiple factors such as heredity and environment, and a large number of studies have confirmed that alcohol dependence is closely related to genetic factors. In the study of the association between neuropeptide Y (NPY) gene polymorphism and alcohol dependence, researchers have conducted more than 10 years of research in different populations around the world. But in the two major single nucleotide polymorphisms (SNP), rs16139 and rs16147 sites, the results of the study were not consistent, even completely opposite, that the susceptible genes associated with AD had not been finalized. This is because different populations, different ethnic backgrounds and environmental factors affect the same gene. There may be differences in alleles and genotype frequencies between different races, so there may be differences in the impact of the same disease. How to use the available random case control research data to find the susceptible genes of alcohol dependence, screening and providing early intervention for high-risk groups from the gene level The realization of individualized treatment has important clinical value and social benefits. In the fourth chapter, in the fourth chapter, we mainly focus on whether there is a significant correlation between NPY gene polymorphism and alcohol dependence. SNP's Meta analysis is used to quantitatively analyze and evaluate the current published epidemiological literature on the polymorphism of neuropeptide Y (NPY) gene, especially the two important SNP (rs16139 mouth rs16147) and AD. In this chapter, we strictly comply with the basic requirements of SNP's Meta analysis method, through the extensive collection of existing domestic and foreign countries. A quantitative and comprehensive analysis of the existing literature on the association between NPY gene polymorphism and alcohol dependence. First, a balance test of the Hardy Weinberg equilibrium (HWE) law of genetic balance (Hardy-Weinberg equilibrium, HWE) was carried out in the control group, and then the heterogeneity of each study was tested. The above test was passed and used on the basis of Log The optimal genetic model selection strategy of the istic regression model determines that the dominant genetic model is used to merge the P values of each study, and a subgroup analysis is carried out. Finally, the publication bias is excluded by the funnel plot, the Egger linear regression and the Begg rank correlation method. The results show that there is no sufficient evidence for the NPY gene polymorphism at present. There was a significant correlation between sex and alcohol dependence. But in the subgroup analysis, the SNP rs16139 of other groups (such as Finns) was found to be associated with alcohol dependence. The Meta analysis of multiple existing results in this chapter increased the sample size from a statistical point of view and improved the effectiveness of the test, especially when multiple research results were inconsistent or When there is no statistical significance, meta analysis can be used to get a comprehensive analysis that is closer to the real situation. It provides a scientific basis for clinicians and researchers to understand the pathogenesis of alcohol dependence and its genetic diagnosis and treatment. In the fifth chapter, we mainly summarize the research on four sub topics, especially deep. The deficiencies and causes of these researches are analyzed. Finally, the improvement plan for future research is given.
【学位授予单位】:山东大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:R318;TN911.7
,
本文编号:2061430
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2061430.html