基于最大信息系数的复杂疾病全基因组关联算法研究

发布时间：2018-04-17 10:48

本文选题：全基因组关联研究 + 最大信息系数　；参考：《电子科技大学》2015年博士论文

【摘要】：全基因组关联研究是人类基因组计划完成后实施的一种对复杂疾病的成套DNA全基因组测序和扫描计划,以试图发现疾病的基因变异和单核苷酸多态性,研究并确定疾病易感区域和相关基因,寻找疾病标记物,实现疾病的早期诊断和有效的个性化治疗,开发新的药物和采取特异性防治措施。此类研究是在全基因组层面上开展的多中心、大样本和反复验证的基因与疾病的关联研究,试图全面揭示疾病发生、发展与治疗相关的遗传基因。为达到关联研究的目的,许多有前景的算法或专用工具软件相继问世。虽然现有算法在计算和统计学上被验证为有用工具,但有研究指出它们在通用数据上的表现还存在较大的不明确性。同时,由于全基因组数据本身具有数据量庞大、离散等特殊性,使得现有算法在算法效率、统计功效和假阳率等方面并不尽人意,所以,进一步发展新的全基因组关联研究算法仍然是生物信息学研究人员需要不懈努力的方向。为此,本文开展了以下研究工作:(1)分析和研究了最大信息系数(Maximal Information Coefficient,MIC)。MIC是一种新颖的统计方法,它能够很好地满足相关变量分析中的公平性和通用性,明显优于常见的皮尔逊系数、Spearman系数、互信息、CorGC和最大相关系数,因此本文将该方法引入全基因组关联研究。本文从数学上讨论了MIC原理,证明了它的一个重要递推式,详细介绍了MIC算法的实现步骤,最后分析了把MIC直接引入到基因型数据的全基因组关联研究的不足和基于MIC的全基因组关联研究的可行性。(2)提出了基于MIC的疾病-SNP关联搜索算法MICSNPs。MICSNPs使用蒙特卡洛置换检验把MIC值映射到P值,消除了MIC值波动的影响,同时结合基于滑动窗口二分搜索算法来节约算法时间(该算法时间约为线性搜索的0.58%)。为了使MICSNPs在算法统计功效、假阳率和算法时间三者之间取得最佳折衷,本文还研究了蒙特卡洛采样数与上述三项指标之间的关系,发现了最佳的蒙特卡洛采样数为2~4倍的生物标记数量,与样本大小无关。基于真实全基因组关联数据和仿真数据的测试结果表明,在把蒙特卡洛采样次数缩减为标记数量的4倍并使用基于滑动窗口的二分搜索算法后,MICSNPs无论在计算性能及统计学上均是可行和有效的,且其整体性能优于现有算法。(3)提出了基于MIC的疾病-SNP关联搜索算法mBoMIC。首先,通过对传统Bagging算法的修改,本文提出了一种mBagging(modified Bagging)算法,其中心思想就是把传统Bagging算法相同的袋内和袋外自举抽样数据量改变为不同,且要求袋内数量少于袋外数量。由于较少的袋内数据在保证最佳统计功效的同时降低了计算复杂度,而较多的袋外数据又进一步提高统计功效,所以mBaggnig算法达到了在缩减算法时间的前提下提升统计功效的目的。另外,较少的袋内数据,减轻了传统Bagging算法的“过拟合”现象,因此,mBagging算法的假阳率比传统Bagging算法的低。本文提出的mBagging算法的主要贡献是把原本矛盾的“统计功效”、“假阳率”和“算法时间”三个指标同时得到了较大的改善。接着,使用本文提出的mBagging算法对MIC方法进行组合,形成了一种新型的疾病-SNP关联搜索算法mBoMIC。mBoMIC算法结合了MIC和mBagging算法的优点,克服了MIC的低统计功效并避免了MIC值的波动现象。在500组数据上,本文将分别使用20、400作为袋内、袋外数据抽样数的mBoMIC算法与使用抽样数为400的传统Bagging算法相比较,mBoMIC算法的平均算法时间减少了80.3%、平均统计功效增加了15.2%、平均假阳性率减少了31.3%。最后,采用仿真和真实数据测试mBoMIC算法,结果表明新算法比现有算法具有更好的统计功效,在生物标记选择上是一种可行的算法。(4)构建了基于MIC的疾病相关差异表达基因/microRNA识别算法。全基因组关联研究算法不仅可用于探索基因型数据,也能分析基因/micro RNA表达数据。本文利用MIC构造了基因/microRNA表达谱分析算法,用于全基因组微阵列表达数据中挖掘与疾病关联的基因/microRNA。本文采用新算法分析了一个房颤-对照的基因表达数据和一个瓣膜性心脏病-对照的microRNA表达数据,识别出41个房颤差异表达基因,其中有14个基因是已有工作未发现的新差异表达基因。信号通路和富集分析表明,这些差异表达基因与房颤高度相关;发现了2个强烈差异表达的microRNA,其中hsa-miR-221*是已有工作未发现的新差异表达microRNA。本文顺利地把MIC引入到了全基因组关联研究,克服了MIC的不足,建立了MICSNPs、mBoMIC和微阵列基因/microRNA表达谱分析等多种有效算法,为从全基因组数据中搜索和识别复杂疾病关联的生物标记提供了重要的计算工具。
[Abstract]:A genome-wide association study of complex diseases is a complete DNA completion of the human genome project after the implementation of whole genome sequencing and scanning plan, in order to find the disease gene mutation and single nucleotide polymorphism, and research to identify susceptible regions and genes related to diseases, looking for disease markers, achieve early diagnosis of disease and effective personalized for the development of new drugs and take specific prevention measures. This kind of research is to carry out multi center at the whole genome level, gene and disease association studies of large sample and repeated verification, trying to fully reveal the disease occurrence, development and treatment of genetically related. In order to achieve the goal of the correlation study, many promising the algorithm or special tools have been published. Although the existing algorithm in computing and statistics has proven to be a useful tool, but studies have pointed out that in general data The performance is also uncertainty larger. At the same time, because the whole genome data itself has a huge amount of data, such as the particularity of the discrete, existing algorithms in the efficiency of the algorithm, the statistical effect and the false positive rate and unsatisfactory, therefore, further development of whole genome association studies of the new algorithm is still the researchers of biological information learn the unremitting efforts direction. Therefore, this paper carried out the following research work: (1) research and analysis of the maximum information coefficient (Maximal Information Coefficient, MIC.MIC) is a novel statistical method, it can well meet the relevant variables in the analysis of fairness and universality, is better than the Pearson coefficient. The common Spearman coefficient, mutual information, CorGC and the maximum correlation coefficient, this paper will introduce the method of genome-wide association studies. This article discusses the MIC principle in mathematics, it is proved that the A A recursive, detailed introduces the implementation steps of MIC algorithm, finally analyzed the MIC directly into the lack of a genome-wide association study of genotypic data and based on the feasibility of genome-wide association study MIC. (2) proposed the disease -SNP association based on MIC search algorithm MICSNPs.MICSNPs to use Monte Carlo permutation test the MIC value is mapped to the P value, MIC eliminates the fluctuations in the value of two points, combined with the sliding window search algorithm based on time saving algorithm (the algorithm time is approximately linear search 0.58%). In order to make the MICSNPs algorithm in statistical power, achieve the best trade-off between false positive rate and time of the three algorithms, this paper also studies the relationship between Monte Carlo sampling number and the above three indexes, it was found that the optimum number of biomarkers for Monte Carlo sampling number 2~4 times, has nothing to do with the real whole genome based on sample size. The test results related data and simulation data show that the number of Monte Carlo sampling was reduced to 4 times the number of markers and use the sliding window search algorithm based on two points, MICSNPs is feasible and effective in terms of performance calculation and statistics, and its overall performance is better than the existing algorithm. (3) proposed disease -SNP MIC search algorithm based on mBoMIC. firstly, based on the traditional Bagging algorithm changes, this paper proposes a mBagging (modified Bagging) algorithm, the main idea is to the traditional Bagging algorithm has the same bag and the bag of bootstrap sampling data for different amount of change, and the bag bag number is less than the quantity. Due to less data in the bag to ensure the best statistical power while reducing the computational complexity, and more of the bag outside the data and further improve statistical efficiency, so mBaggnig algorithm achieves a reduction in To enhance the statistical power under the premise of time. In addition, less data bag, compared with the traditional Bagging algorithm over fitting phenomenon, therefore, the false positive rate of the mBagging algorithm than the traditional Bagging algorithm. The main contribution of mBagging algorithm is proposed in this paper is the original contradiction of the "statistical power" "false positive rate" and "algorithm" three indicators also improved. Then, using the proposed mBagging algorithm for the combination of the MIC method, the formation of a new type of disease associated -SNP search algorithm mBoMIC.mBoMIC algorithm combines the advantages of MIC and mBagging algorithm, to overcome the low statistical power of MIC and to avoid the fluctuation of MIC value. In 500 groups of data, this paper will use 20400 as the bag, mBoMIC bag outside the data sampling algorithm and traditional Bagging algorithm using sampling number is 400 compared Compared with the average time of mBoMIC algorithm, the algorithm is reduced by 80.3%, the average statistical power increased by 15.2%, the average false positive rate decreased by 31.3%. finally, by simulation and real data test results show that the mBoMIC algorithm, the new algorithm has better statistical power than the existing algorithm, in the selection of biomarkers is a feasible algorithm (. 4) MIC was constructed based on disease related gene /microRNA recognition algorithm. The algorithm of genome-wide association studies not only can be used to explore the genotype data, to analyze the /micro gene RNA expression data. This paper uses the MIC to construct /microRNA gene expression spectrum analysis algorithm for whole genome microarray data mining and related diseases this paper adopts a new algorithm /microRNA. gene analysis of a af - control of gene expression data and a valvular heart disease control microRNA expression data, identify 4 Gene expression of 1 AF differences, of which 14 genes are new gene expression differences have not found work. And the signal pathway enrichment analysis showed that these differentially expressed genes associated with AF highly real; found the expression of 2 strong differences in microRNA, where hsa-miR-221* is a new work has not found the difference expression of microRNA. this paper successfully introduced MIC to a genome-wide association study to overcome the shortcomings of MIC, established the MICSNPs, mBoMIC and /microRNA gene expression in a variety of spectral analysis algorithm, provides an important tool for the calculation of biological markers to search and identify associated complex diseases from the whole genome data.

【学位授予单位】：电子科技大学
【学位级别】：博士
【学位授予年份】：2015
【分类号】：R3416

【相似文献】