当前位置:主页 > 科技论文 > 软件论文 >

基于生物医学文本挖掘的蛋白质间相互作用关系抽取方法的研究

发布时间:2018-10-15 15:43
【摘要】:近年来,在生物医学领域随着其文献数量的快速增长,利用数据挖掘技术从生物医学文献中获取所需的生物医学知识已经成为生物信息学领域的研究热点。蛋白质发挥其生物功能最基础和重要的一种方式就是通过蛋白质间相互作用(Protein-Protein Interaction,PPI),而大量的蛋白质相互作用信息都以非结构化数据的形式记录在生物医学文献中,人工检阅的方式查找文献中的PPI信息十分耗时费力,因此,利用文本挖掘技术对生物医学文献中的蛋白质相互作用关系进行挖掘和分析,从而准确的提取PPI关系具有十分重要的意义。现有的PPI关系抽取的研究中将从生物医学文献中抽取PPI关系视为一个二值分类问题,PPI抽取任务中多采用基于统计和机器学习的算法,通过对生物文本进行特征提取形成特征向量,从而构建分类模型,取得了较好的抽取效果。但是现有研究中所采用的机器学习方法通常是监督学习方法,需要大量已标注的PPI关系数据来构建分类模型,而在生物医学领域,利用人工标注PPI关系语料需要花费大量的人力和时间成本。为了降低构建分类模型对标注数据的要求,本文从以下两个方面进行研究:1、基于远程监督和迁移学习提取蛋白质相互作用关系将待分类的PPI关系数据集视为目标领域数据集,为降低目标领域PPI关系抽取中对标注数据的需求,本研究使用迁移学习,通过对不同分布的源领域PPI关系数据集进行知识迁移,来构建关系抽取模型,从而对目标领域待分类PPI样本进行分类。本研究基于远程监督思想构建人工标注语料作为源领域PPI数据集,首先从IntAct蛋白质相互作用数据库中下载PPI数据作为关系知识库,并从PubMed数据库中爬取生物医学文献摘要作为原始语料集,根据知识库中的PPI对在原始语料集中进行映射,通过启发式的匹配来获取包含有该PPI的语句,将原始语料集中存在映射的PPI作为正例样本,否则作为负例样本,以此得到人工标注的PPI数据集。使用基于实例的迁移学习方法TrAdaboost在构建的源领域PPI数据集和部分目标PPI数据集上构建分类模型,对目标领域的PPI样本进行分类。在3个标准数据集上的实验结果表明,本研究利用远程监督构建的人工数据集能够很好的辅助算法建立分类模型,在目标领域标注样本较少的情况下,通过迁移人工数据集的知识对目标领域PPI关系进行抽取具有较好的性能。2、PU(Positive Unlabeled)场景下基于迁移学习和远程监督提取蛋白质相互作用在实际应用中,数据经常是未标注的或少量标注的,如本研究中涉及到的PPI数据集。由于实验条件的制约,现有的很多PPI关系并不能确定其是否有相互作用,因此可以将这部分数据视为未标注数据集,仅有少量的PPI关系经实验验证后确实存在相互作用,这部分数据可以视为正例样本。在这种情况下,传统的有监督算法就无法构建高效的分类模型来对生物文献中的PPI关系进行识别。在远程监督的基础上,本研究从迁移学习和PU学习两个角度展开研究,提出了在PU场景下基于迁移学习和远程监督的蛋白质相互作用关系抽取方法——TPAODE算法。该方法收集目标PPI数据集的特征信息,利用数据引力方法对源PPI数据集样本赋予权重进行知识迁移,基于贝叶斯理论在加权的源PPI数据集上估算概率参数,利用静态分类器集成技术构建基于权重的PU学习算法。实验结果表明,本研究提出的TPAODE算法对目标领域PPI数据集不需要类别标注,仅在源领域PPI数据集上标注部分有相互作用关系的样本,基于源领域PPI数据集和目标领域PPI数据集构建分类模型,具有比传统PU方法相当或更好的性能。为了进一步降低模型对标注数据的要求,本研究将前文利用远程监督构建的人工PPI数据集作为源领域数据集,基于仅有少量正例样本的源数据集和目标数据集学习模型,对目标领域的PPI样本进行分类,结果表明,本研究提出的TPAODE算法利用远程监督数据集依然比现有的PU学习方法PNB和PTAN具有更优异的分类性能。
[Abstract]:In recent years, with the rapid increase of the number of documents in the field of biomedicine, the use of data mining technology to acquire the necessary biomedical knowledge from biomedical literature has become a hot topic in the field of bioinformatics. Protein plays the most fundamental and important role of its biological function by protein interaction (PPI), while a large number of protein interaction information is recorded in biomedical literature in the form of unstructured data, It is very time consuming to find PPI information in the literature by manual review. Therefore, it is very important to extract PPI relation accurately by mining and analyzing the relationship between protein interaction in biomedical literature by using text mining technique. In the research of PPI relationship extraction, PPI relations are extracted from biomedical literature as a two-valued classification problem. In PPI extraction task, statistical and machine-based learning algorithms are adopted, and feature vectors are formed by feature extraction of biological texts so as to construct a classification model. a better extraction effect is obtained. However, the machine learning methods employed in the present research are usually supervised learning methods, require a large number of labeled PPI relational data to construct the classification model, and in the field of biomedicine, it is necessary to spend a large amount of manpower and time cost in the field of biomedicine. In order to reduce the requirement of constructing classification model to label data, this paper studies on the following two aspects: 1, the PPI relational data set to be classified is regarded as the target domain data set based on the relationship between the remote supervision and the migration learning extraction protein, In order to reduce the demand for dimension data in PPI relationship extraction in the target field, this study uses migration learning to construct a relationship extraction model by carrying out knowledge migration on PPI relational data sets in different distributed source fields, thus classifying PPI samples to be classified in the target field. In this paper, based on the remote supervision idea, the author constructs the artificial dimension corpus as the source field PPI data set, first downloads PPI data from the InteAct protein interaction database as the relation knowledge base, and climbs the biomedical literature abstract from the PubMed database as the original corpus, According to PPI pairs in the knowledge base, mapping is carried out in the original corpus, a statement containing the PPI is obtained by a heuristic matching, and the PPI with the mapping exists in the original corpus is taken as a positive example sample, otherwise, the artificially annotated PPI data set is obtained as a negative sample sample. Using the example-based migration learning method, TrAdaboost constructs a classification model on the constructed source domain PPI data set and partial target PPI data set, and classifies PPI samples in the target field. The experimental results on three standard data sets show that this study uses the artificial data set constructed by the remote supervision to establish a classification model, and in the case of fewer samples in the target field, extracting protein interactions based on migration learning and remote monitoring under a PU (Positive Unlabed) scenario is often not marked or marked in a small amount, such as the PPI data set involved in this study. Due to the constraints of experimental conditions, the existing PPI relationships do not determine whether they interact, so this part of data can be treated as unlabeled data sets, only a small number of PPI relationships do exist after the experimental verification, and this part of data can be considered as positive samples. In this case, traditional supervised algorithms fail to construct efficient classification models to identify PPI relationships in biological literature. On the basis of remote supervision, this paper studies two aspects of migration learning and PU learning, and proposes a method for extracting protein interaction relationship based on migration learning and remote supervision in PU scene. The method collects the characteristic information of the target PPI data set, carries out knowledge migration to the weight of the source PPI data set sample by utilizing the data attraction method, estimates the probability parameter on the weighted source PPI data set based on the Bayesian theory, A weight-based PU learning algorithm is constructed by using static classifier integration technology. The experimental results show that the TPAODE algorithm proposed in this study does not need a category label for PPI data sets in the target field, and only a sample with an interaction relationship is labeled in the PPI data set in the source field, and a classification model is constructed based on the PPI data set in the source field and the PPI data set in the target field. have comparable or better performance than conventional pu methods. In order to further reduce the requirement of the model to dimension data, this study uses the artificial PPI data set constructed by the remote supervision as the source field data set, classifies the PPI samples in the target field based on the source data set and the target data set learning model with only a few positive samples, The results show that the TPAODE algorithm proposed by this study still has better classification performance than the existing PU learning methods PNB and PTAN.
【学位授予单位】:西北农林科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q51;TP391.1

【参考文献】

相关期刊论文 前9条

1 李满生;常乘;马洁;朱云平;;基于机器学习的蛋白质相互作用文献挖掘方法研究进展[J];中国科学:生命科学;2016年11期

2 张金蕾;李梅;张阳;梁春泉;王勇;;P-AnDT:平均n依赖决策树的正例未标注学习算法[J];计算机应用研究;2016年07期

3 张荷;李梅;张阳;蔡晓妍;;基于PU学习的软件故障检测研究[J];计算机应用研究;2015年11期

4 潘云;布勒布丽汗·伊沙巴依;杨静;尹敏;;利用中文在线资源的远程监督人物关系抽取[J];小型微型计算机系统;2015年04期

5 邵强;张阳;蔡晓妍;;基于随机森林的正例与未标注学习[J];计算机工程与设计;2014年12期

6 庄福振;罗平;何清;史忠植;;迁移学习研究进展[J];软件学报;2015年01期

7 王健;冀明辉;林鸿飞;杨志豪;;基于上下文环境和句法分析的蛋白质关系抽取[J];计算机应用;2012年04期

8 李满生;刘齐军;李栋;刘培磊;朱云平;;蛋白质相互作用信息的文本挖掘研究进展[J];中国科学:生命科学;2010年09期

9 蒋盛益,李庆华;一种基于引力的聚类方法[J];计算机应用;2005年02期

相关硕士学位论文 前8条

1 郭瑞;基于迁移学习和词表示的蛋白质交互关系抽取[D];大连理工大学;2015年

2 宋宝兴;功能相似蛋白质挖掘及蛋白质相互作用预测平台[D];西北农林科技大学;2013年

3 封二英;基于大规模文本的蛋白质交互关系自动提取研究[D];南京航空航天大学;2012年

4 孙雅铭;生物医学文本中蛋白质相互作用关系抽取关键技术研究[D];哈尔滨工业大学;2012年

5 何佳珍;不确定数据的PU学习贝叶斯分类器研究[D];西北农林科技大学;2012年

6 李满生;基于本体的蛋白质相互作用信息文本挖掘方法研究[D];中国人民解放军军事医学科学院;2010年

7 虞欢欢;基于机器学习的蛋白质相互作用关系抽取的研究[D];苏州大学;2010年

8 戴文渊;基于实例和特征的迁移学习算法研究[D];上海交通大学;2009年



本文编号:2273008

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2273008.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户10cc6***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com