基于生物医学文本挖掘的蛋白质间相互作用关系抽取方法的研究
[Abstract]:In recent years, with the rapid increase of the number of documents in the field of biomedicine, the use of data mining technology to acquire the necessary biomedical knowledge from biomedical literature has become a hot topic in the field of bioinformatics. Protein plays the most fundamental and important role of its biological function by protein interaction (PPI), while a large number of protein interaction information is recorded in biomedical literature in the form of unstructured data, It is very time consuming to find PPI information in the literature by manual review. Therefore, it is very important to extract PPI relation accurately by mining and analyzing the relationship between protein interaction in biomedical literature by using text mining technique. In the research of PPI relationship extraction, PPI relations are extracted from biomedical literature as a two-valued classification problem. In PPI extraction task, statistical and machine-based learning algorithms are adopted, and feature vectors are formed by feature extraction of biological texts so as to construct a classification model. a better extraction effect is obtained. However, the machine learning methods employed in the present research are usually supervised learning methods, require a large number of labeled PPI relational data to construct the classification model, and in the field of biomedicine, it is necessary to spend a large amount of manpower and time cost in the field of biomedicine. In order to reduce the requirement of constructing classification model to label data, this paper studies on the following two aspects: 1, the PPI relational data set to be classified is regarded as the target domain data set based on the relationship between the remote supervision and the migration learning extraction protein, In order to reduce the demand for dimension data in PPI relationship extraction in the target field, this study uses migration learning to construct a relationship extraction model by carrying out knowledge migration on PPI relational data sets in different distributed source fields, thus classifying PPI samples to be classified in the target field. In this paper, based on the remote supervision idea, the author constructs the artificial dimension corpus as the source field PPI data set, first downloads PPI data from the InteAct protein interaction database as the relation knowledge base, and climbs the biomedical literature abstract from the PubMed database as the original corpus, According to PPI pairs in the knowledge base, mapping is carried out in the original corpus, a statement containing the PPI is obtained by a heuristic matching, and the PPI with the mapping exists in the original corpus is taken as a positive example sample, otherwise, the artificially annotated PPI data set is obtained as a negative sample sample. Using the example-based migration learning method, TrAdaboost constructs a classification model on the constructed source domain PPI data set and partial target PPI data set, and classifies PPI samples in the target field. The experimental results on three standard data sets show that this study uses the artificial data set constructed by the remote supervision to establish a classification model, and in the case of fewer samples in the target field, extracting protein interactions based on migration learning and remote monitoring under a PU (Positive Unlabed) scenario is often not marked or marked in a small amount, such as the PPI data set involved in this study. Due to the constraints of experimental conditions, the existing PPI relationships do not determine whether they interact, so this part of data can be treated as unlabeled data sets, only a small number of PPI relationships do exist after the experimental verification, and this part of data can be considered as positive samples. In this case, traditional supervised algorithms fail to construct efficient classification models to identify PPI relationships in biological literature. On the basis of remote supervision, this paper studies two aspects of migration learning and PU learning, and proposes a method for extracting protein interaction relationship based on migration learning and remote supervision in PU scene. The method collects the characteristic information of the target PPI data set, carries out knowledge migration to the weight of the source PPI data set sample by utilizing the data attraction method, estimates the probability parameter on the weighted source PPI data set based on the Bayesian theory, A weight-based PU learning algorithm is constructed by using static classifier integration technology. The experimental results show that the TPAODE algorithm proposed in this study does not need a category label for PPI data sets in the target field, and only a sample with an interaction relationship is labeled in the PPI data set in the source field, and a classification model is constructed based on the PPI data set in the source field and the PPI data set in the target field. have comparable or better performance than conventional pu methods. In order to further reduce the requirement of the model to dimension data, this study uses the artificial PPI data set constructed by the remote supervision as the source field data set, classifies the PPI samples in the target field based on the source data set and the target data set learning model with only a few positive samples, The results show that the TPAODE algorithm proposed by this study still has better classification performance than the existing PU learning methods PNB and PTAN.
【学位授予单位】:西北农林科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q51;TP391.1
【参考文献】
相关期刊论文 前9条
1 李满生;常乘;马洁;朱云平;;基于机器学习的蛋白质相互作用文献挖掘方法研究进展[J];中国科学:生命科学;2016年11期
2 张金蕾;李梅;张阳;梁春泉;王勇;;P-AnDT:平均n依赖决策树的正例未标注学习算法[J];计算机应用研究;2016年07期
3 张荷;李梅;张阳;蔡晓妍;;基于PU学习的软件故障检测研究[J];计算机应用研究;2015年11期
4 潘云;布勒布丽汗·伊沙巴依;杨静;尹敏;;利用中文在线资源的远程监督人物关系抽取[J];小型微型计算机系统;2015年04期
5 邵强;张阳;蔡晓妍;;基于随机森林的正例与未标注学习[J];计算机工程与设计;2014年12期
6 庄福振;罗平;何清;史忠植;;迁移学习研究进展[J];软件学报;2015年01期
7 王健;冀明辉;林鸿飞;杨志豪;;基于上下文环境和句法分析的蛋白质关系抽取[J];计算机应用;2012年04期
8 李满生;刘齐军;李栋;刘培磊;朱云平;;蛋白质相互作用信息的文本挖掘研究进展[J];中国科学:生命科学;2010年09期
9 蒋盛益,李庆华;一种基于引力的聚类方法[J];计算机应用;2005年02期
相关硕士学位论文 前8条
1 郭瑞;基于迁移学习和词表示的蛋白质交互关系抽取[D];大连理工大学;2015年
2 宋宝兴;功能相似蛋白质挖掘及蛋白质相互作用预测平台[D];西北农林科技大学;2013年
3 封二英;基于大规模文本的蛋白质交互关系自动提取研究[D];南京航空航天大学;2012年
4 孙雅铭;生物医学文本中蛋白质相互作用关系抽取关键技术研究[D];哈尔滨工业大学;2012年
5 何佳珍;不确定数据的PU学习贝叶斯分类器研究[D];西北农林科技大学;2012年
6 李满生;基于本体的蛋白质相互作用信息文本挖掘方法研究[D];中国人民解放军军事医学科学院;2010年
7 虞欢欢;基于机器学习的蛋白质相互作用关系抽取的研究[D];苏州大学;2010年
8 戴文渊;基于实例和特征的迁移学习算法研究[D];上海交通大学;2009年
,本文编号:2273008
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2273008.html