基于支持向量机的基因-基因交互关系识别方法研究

发布时间：2018-09-09 15:27

【摘要】：全基因组关联研究(Genome-Wide Association Study,GWAS)通常以单核苷酸多态性(Single Nucleotide Polymorphism,SNP)为标记分析复杂疾病的遗传易感性。然而由于基因数据具有样本数量小,数据维度高,数据噪声大等特点,通过传统实验的方式研究基因之间的交互关系比较耗时、费力,并且成本高昂,因此借助数据挖掘的相关技术准确的分析基因-基因的交互关系对于复杂疾病的病因探索或寻找易感基因有着重要意义。本文以基因-基因交互关系作为研究对象,针对目前基因-基因交互关系识别方法研究存在的不足,通过数据挖掘技术,提出一种新的交互关系识别算法。具体的研究内容为:(1)提出了基于支持向量机的基因交互关系识别算法SVMITER。为了避免多重比较问题中结果假阳性过高,本文首先根据支持向量机理论结合笛卡尔积算法思想,提出了基于支持向量机的属性组合迭代算法SVMITER。算法首先使用支持向量机对SNP进行初期的筛选,然后将筛选后的SNP使用笛卡尔积算法进行SNP属性的组合,接着将得到的SNP组合使用支持向量机进行模型构建,根据F-Measure值判断该模型是否是最佳模型,如果不是继续使用笛卡尔积算法进行更高阶SNP属性的组合,以此类推直到得到最佳预测模型为止。在模拟数据的实验中,首先进行了核函数选择和参数调优,接着使用了评估指标Precision、Recall和F-Measure值与三种数据挖掘算法BOOST、Random Forest和MDR进行对比实验,发现SVMITER算法性能最优。(2)基于SVMITER算法的低阶基因-基因交互关系识别。本文使用SVMITER方法针对低阶基因交互关系进行识别,分别采用模拟数据与真实数据两种数据集。通过与现有方法的比较和两个真实案例分析表明,SVMITER算法在模拟数据中识别性能POWER值比BOOST算法高,在真实数据中可以准确的识别出SNP组合rs380390和rs1329428等。(3)基于SVMITER算法的高阶基因-基因交互关系识别。在低阶基因交互关系的研究基础上,本文继续使用SVMITER方法针对高阶基因-基因交互关系问题进行分析研究,采用模拟数据与真实数据两种数据集。通过与现有方法的比较和案例分析表明,SVMITER算法在高阶模拟数据中可以识别出5阶SNP组合,识别性能POWER值依旧比BOOST算法高,在真实数据中准确的识别出已被发现的5阶SNP组合。
[Abstract]:Genome-Wide Association Study (GWAS) usually uses single nucleotide polymorphism (SNP) as a marker to analyze the genetic susceptibility of complex diseases. However, due to the characteristics of small sample size, high data dimension and high data noise, traditional experimental methods are used to study genes. Interaction between genes is time-consuming, laborious and costly. Therefore, it is important to analyze gene-gene interaction accurately with the help of data mining technology for exploring the etiology of complex diseases or finding susceptible genes. In order to avoid false positive results in multiple comparisons, this paper first proposes a new algorithm based on support vector machine (SVMITER). Based on the idea of Cartesian product algorithm, an attribute combination iterative algorithm SVMITER based on support vector machine is proposed. The algorithm first uses support vector machine to filter the SNP in the initial stage, then uses Cartesian product algorithm to combine the SNP attributes, and then uses support vector machine to construct the SNP model. According to the F-Measure value, we can judge whether the model is the best or not. If we do not continue to use Cartesian product algorithm to combine higher-order SNP attributes, and so on, until we get the best prediction model. In the experiment of simulation data, we first select the kernel function and tune the parameters, then use the evaluation index Precision, Recall. Compared with the three data mining algorithms BOOST, Random Forest and MDR, the SVMITER algorithm has the best performance. (2) Recognition of low-order gene-gene interaction based on SVMITER algorithm. Comparing with existing methods and two real case studies, SVMITER has higher POWER value in simulated data than BOOST, and can accurately identify SNP combinations such as rs380390 and rs1329428 in real data. (3) High-order gene-gene interaction recognition based on SVMITER. Low-order gene interaction On the basis of the research on the relationship, this paper continues to use SVMITER method to analyze and study the high-order gene-gene interaction problem, using two kinds of data sets, simulation data and real data. By comparing with the existing methods and case analysis, SVMITER algorithm can identify the 5th-order SNP combination in high-order simulation data and identify the performance PO. The WER value is still higher than the BOOST algorithm, and the 5 order SNP combination has been identified accurately in real data.
【学位授予单位】：西北农林科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：Q811.4;TP311.13

【相似文献】