基于内存计算的基因型—表型关联技术研究

发布时间：2018-02-28 23:28

本文关键词： 疾病表型致病基因优先级 TrustRank 大数据　出处：《哈尔滨工业大学》2017年硕士论文　论文类型：学位论文

【摘要】：伴随生物医学数据得到爆炸式增长,快速发展的生物信息学也在不断剖析这些数据背后隐藏的信息,相关研究已成为热点。识别致病基因是人类健康研究的根本挑战,针对识别致病基因就要通过生物网络了解基因型与疾病表型的关联关系。海量生物数据存储在各种没有统一标准化的数据库中,生物网络都是以这些数据为基础构建起来,而且研究生物网络也是在对探索复杂生命活动。疾病表型与基因型的关联关系对于致病基因的预测和寻找基因导致的疾病都具有深远意义。根据疾病的模块性表明,功能相关的蛋白质会导致相似疾病。由此,研究疾病基因关联方法大多集中于基于计算网络,整合了蛋白质相互作用网络、疾病表型相似性网络和疾病-基因二分网络。在线孟德尔遗传(OMIM)是人类遗传疾病和相关基因的数据库,基于OMIM数据我们计算形成了疾病表型相似性网络和疾病基因对应网络,再加上蛋白质相互作用网络,整合构建复杂的异构网络。本文介绍了相关的重启游走算法,通过改进网页排序算法Trust Rank后形成YSearch方法。算法首先根据构建网络选择查询疾病(基因)的先验知识(种子集),通过全局网络的随机游走策略迭代处理得到TR分数,然后对候选基因与疾病进行优先级排序,实现预测功能。并且针对算法效果进行留一交叉验证,采用ROC曲线与其他方法比较实验结果,证明算法的良好性能。以此,我们设计并开发了基因疾病的搜索引擎平台YSearch,整个系统是搭建在基于内存计算的spark大数据平台,数据存储在HBase中,并对系统进行相关介绍与优化。本文的算法与平台都可以对疾病诊断与治疗等临床研究提供新思路。
[Abstract]:With the explosive growth of biomedical data, the rapid development of bioinformatics is also analyzing the hidden information behind these data. The related research has become a hot spot. Identification of pathogenic genes is a fundamental challenge in human health research. In order to identify pathogenic genes, we need to understand the relationship between genotypes and disease phenotypes through biological networks. Massive biological data are stored in a variety of databases that are not standardized, and biological networks are built on the basis of these data. Moreover, the study of biological networks is also useful in exploring complex life activities. The association between disease phenotypes and genotypes is of great significance for the prediction of pathogenic genes and the search for diseases caused by genes. Functionally related proteins can lead to similar diseases. Therefore, most of the methods of studying disease gene association are based on computational networks and integrate protein interaction networks. Online Mendelian genetic network is a database of human genetic diseases and related genes. Based on OMIM data, we calculate the disease phenotypic similarity network and disease gene corresponding network. In addition, protein interaction networks are integrated to construct complex heterogeneous networks. The YSearch method is formed by improving the Trust Rank algorithm. Firstly, the algorithm selects a priori knowledge (seed set) to query the disease (gene) according to the construction of the network, and obtains the tr score by iterating the random walk strategy of the global network. Then the candidate genes and diseases are prioritized to achieve the function of prediction, and a cross-validation of the effectiveness of the algorithm is carried out. The experimental results are compared with other methods by using the ROC curve, and the good performance of the algorithm is proved. We have designed and developed the search engine platform YSearch. the whole system is built on the spark big data platform based on memory computing, and the data is stored in HBase. The algorithm and platform of this paper can provide new ideas for clinical research such as disease diagnosis and treatment.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R3416;TP311.13

【参考文献】