主题模型在基因语义相似度计算中的应用与研究

发布时间：2018-01-20 13:40

本文关键词： 基因本体语义相似性 LDA BTM 主题模型　出处：《华东师范大学》2017年硕士论文　论文类型：学位论文

【摘要】：近年来,当生物学家发现未知基因时,往往将它们与已知基因进行比较,然后根据两者之间的相似性来推断未知基因的特性。生物学者通过比对算法来对基因序列或结构进行比较,进而查找出与其功能上相似或者相关的基因。但研究表明,在功能上相似或者相关的基因在序列上并非一定具有很大的相关性。针对上述问题,目前的主流方法是通过计算基因在基因本体中注释到的术语之间的语义相似性来分析和预测未知基因的特性。但是这类方法仅仅利用术语在基因本体中的关联关系间接地反映基因的语义相似度,而没有涉及到术语本身所包含的内在语义内涵。本文提出一种基于主题模型的基因语义相似度算法,从代表术语的文本中挖掘出内在的语义信息,在一定程度上解决了传统方法中的不足。本文主要有如下三个创新点:1.在计算术语对之间相似度时,从基因注释到的术语本身去挖掘潜在的语义信息,然后将代表术语语义信息的文本转化为高维的主题向量,从而将术语之间的相似度转化为代表术语的高维主题向量之间的相似度。2.提出SSGTLDA和SSGTBTM两个模型:对于通过Google搜索引擎得到的术语长文本信息,SSGTLDA模型对文本-主题关系和主题-词关系进行建模,最终得到术语文本的高维主题向量;对于通过基因本体的定义信息得到的术语短文本信息,SSGTBTM模型对整个术语语料库中的词对进行建模,最终得到术语文本的高维主题向量。3.实现SSGTLDA和SSGTBTM两种基因语义相似度计算方法,并分别在术语对和蛋白质对两种数据集上进行实验。实验结果表明本文提出的两种算法均具有较好的效果。
[Abstract]:In recent years, when biologists discover unknown genes, they are often compared with known genes. Then according to the similarity between the two to infer the characteristics of unknown genes. Biologists compare the sequence or structure of genes by comparison algorithm. And then find out the similar or related genes. But the study shows that the functional similarity or related genes are not necessarily very relevant in the sequence. In view of the above problems. The current mainstream method is to analyze and predict the characteristics of unknown genes by calculating the semantic similarity between the terms annotated in the gene body. However, such methods use only the association of terms in the gene body. The lines indirectly reflect the semantic similarity of genes. In this paper, a gene semantic similarity algorithm based on topic model is proposed to extract the intrinsic semantic information from the text representing the terms. To some extent, the shortcomings of the traditional methods are solved. This paper mainly has three innovations: 1.When calculating the similarity between terms pairs, we mine the potential semantic information from the terms themselves. Then the text representing the semantic information of terms is transformed into a high-dimensional topic vector. Thus, the similarity between terms is transformed into the similarity between the high-dimensional subject vectors representing the terms. 2. Two models, SSGTLDA and SSGTBTM, are proposed. For the term long text information obtained through the Google search engine. SSGTLDA model models the text-topic relationship and subject-word relationship, and finally gets the high-dimensional topic vector of the terminology text. For the term short text information obtained from the definition information of the gene ontology, the term pairs in the whole term corpus are modeled by the SSGTBTM model. Finally, the high-dimensional topic vector. 3 of the terminology text is obtained. Two methods of gene semantic similarity calculation, SSGTLDA and SSGTBTM, are implemented. The experiments are carried out on two kinds of data sets: term pair and protein pair. The experimental results show that the two algorithms proposed in this paper have good results.
【学位授予单位】：华东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】