蛋白质磷酸化与疾病关系抽取研究
发布时间:2018-01-11 04:03
本文关键词:蛋白质磷酸化与疾病关系抽取研究 出处:《中国科学技术大学》2017年硕士论文 论文类型:学位论文
更多相关文章: 生物信息学 疾病命名实体识别 医学术语 语义词典 条件随机场蛋白质磷酸化 关系抽取
【摘要】:蛋白质磷酸化是生物体内最重要的一种蛋白质翻译后修饰,目前大量的人类疾病都被证实是由异常的磷酸化修饰所引起的,一些与疾病相关的磷酸化修饰可以被开发为疾病的分子标志或治疗靶标。随着生物医学文献的爆炸性增长,如何从生物医学文献中自动抽取蛋白质磷酸化与疾病之间的关系成为相关领域的研究热点。蛋白质磷酸化与疾病关系抽取任务包括疾病命名实体识别和蛋白质磷酸化与疾病关系判断。目前解决疾病命名实体识别问题的主流方法是机器学习,但是机器学习的方法难以有效地识别疾病命名实体中的医学术语,蛋白质磷酸化与疾病之间的关系抽取目前没有可获得的公开系统。本文对蛋白质磷酸化与疾病之间的关系抽取问题进行了研究,研究工作和贡献如下:本文给出了一种条件随机场与语义词典相结合的疾病命名实体识别方法,其中利用网络资源来构建含有语义信息的医学术语词典可以克服病疾命名实体中的医学术语识别的难点。先使用该词典获得医学术语的语义信息;然后CRF利用这些信息结合词法与词性特征、拼写与领域特征对疾病命名实体进行识别;最后对缩写词识别进行调整,来提升疾病名实体识别的效果。在NCBI Disease Corpus数据集上的实验结果表明,本文方法比DNorm方法提升了约2.5%的F值;在开放数据集上实验验证了本文方法对于较长疾病实体识别具有一定的优势。蛋白质磷酸化与疾病之间的关系分为Absence(缺失)、Presence(存在)、Down-regulation(调降)和Up-regulation(调升)四种类型。本文实现了一个蛋白质磷酸化与疾病关系抽取系统PDRMine,该系统分为三个步骤:首先利用基于规则的蛋白质磷酸化信息抽取系统RLIMS-P从文献中抽取蛋白质磷酸化信息;再利用本文设计的疾病命名实体识别方法识别包含磷酸化信息句子中的疾病命名实体;最后利用基于规则的方法对蛋白质磷酸化与疾病之间的关系类型进行判断。触发词的识别是最后一步的难点,本文通过同义词扩展的方法扩大了触发词集合,提升了蛋白质磷酸化与疾病之间关系类型的判断效果。在开放数据集上取得了 72.6%的准确率和66.4%的召回率。
[Abstract]:Protein phosphorylation is one of the most important organisms within a posttranslational protein modification, at present a large number of human diseases have been confirmed to be caused by abnormal phosphorylation, some phosphorylation associated with the disease can be developed as a marker of disease or therapeutic targets. Along with the explosive growth of biomedical literature, how to become a hot research topic in related fields of biomedical literature from the relationship between automatic extraction of protein phosphorylation and protein phosphorylation. The relation between the disease and disease extraction tasks including disease named entity recognition and relation between protein phosphorylation and disease. At present the mainstream method of judgment disease named entity recognition is the problem of machine learning, but the methods of machine learning to to identify the disease named medical terminology in the entity and relation extraction between protein phosphorylation and disease is not The open system can be obtained. This paper studied the relationship between the extraction of protein phosphorylation and disease, research work and contributions are as follows: This paper presents a conditional random field and semantic dictionary combining disease named entity recognition method to build the medical terminology dictionary containing semantic information can overcome the difficulty of medical terminology recognition disease named entity in the use of network resources. The first use of the semantic information dictionary for medical terminology; then CRF uses these information combined with lexical and POS features, spelling and domain feature of the disease named entity recognition; finally, to adjust the identification of abbreviations, to enhance the disease name recognition in effect. NCBI Disease Corpus data sets. The experimental results show that this method improves the DNorm method than about 2.5% F-measure; in the open data set on the experiment The method has some advantages for longer disease entity recognition. The relationship between protein phosphorylation and disease were divided into Absence (deletion), Presence (present), Down-regulation (cut) and Up-regulation (up) four types. This paper implements a relationship between protein phosphorylation and disease PDRMine extraction system, the system is divided into three steps: first, based on the RLIMS-P protein phosphorylation system of information extraction rules extraction of protein phosphorylation information from the literature; then the disease named entity recognition method to identify phosphorylation information in the sentence contains a disease named entity; the type of relationship between the rule-based method of protein phosphorylation and disease of judge. The trigger word recognition is difficult in the last step, the synonym expansion to expand the trigger word set, lifting the protein p The effect of the type of relationship between acidification and disease. The accuracy rate of 72.6% and the recall rate of 66.4% were obtained on the open data set.
【学位授予单位】:中国科学技术大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前4条
1 张宏涛;黄民烈;朱小燕;;基于自训练的蛋白质相互作用关系抽取方法[J];清华大学学报(自然科学版);2012年03期
2 姜铮;王芳;何湘;刘大伟;陈宣男;赵红庆;黄留玉;袁静;;蛋白质磷酸化修饰的研究进展[J];生物技术通讯;2009年02期
3 王浩畅;赵铁军;;生物医学文本挖掘技术的研究与进展[J];中文信息学报;2008年03期
4 刘婷;王文礼;姜丽丽;;磷酸化蛋白质组学研究现状[J];内蒙古医学院学报;2007年04期
相关硕士学位论文 前1条
1 杨娅;生物医学文本中的疾病实体识别和标准化研究[D];大连理工大学;2015年
,本文编号:1408014
本文链接:https://www.wllwen.com/shoufeilunwen/xixikjs/1408014.html