基于词表示和深度学习的生物医学关系抽取
发布时间:2018-06-24 09:02
本文选题:词表示 + 深度学习 ; 参考:《大连理工大学》2016年博士论文
【摘要】:蛋白质关系抽取和药物关系抽取对于生物医学领域相关数据库的构建、生命科学研究、药物开发和疾病的防治都具有重要意义。目前,大量生物医学关系抽取方法的研究重点在于特征集合的选取和核函数的设计,经过十余年的发展,基于特征和核函数的方法已经相对成熟,提升空间变得有限。为了进一步提升性能,本文研究基于词表示和深度学习的抽取方法。深度学习能够建立更深层的关系抽取模型以提升抽取效果,而词表示将语义信息融合到词向量中,是深度学习的前提。本文主要贡献包括:针对生物医学领域文本的特点设计词表示模型,在传统词表示模型基础上,融合词形、词性、词干、句法块、生物医学命名实体这五类重要信息,增强词向量的语义表示能力,并在蛋白质关系抽取、药物关系抽取等任务上取得了较好的效果,验证了在词表示中融入词性、实体等丰富信息的有效性,为基于深度学习的关系抽取方法提供了良好的词表示基础。针对蛋白质二类关系抽取问题,克服传统方法依赖于特征和核函数的局限性,提出一种基于实例表示的抽取模型,该模型包含词向量、骨架特征、特征组合三个部分,在规模较大的语料上抽取效果达到了目前先进水平,从而验证了基于词表示和深度学习方法在蛋白质关系抽取问题上的有效性。该模型考虑了蛋白质关系实例的特点,以词向量作为输入,配合骨架特征和向量组合,从而在实例表示中融合丰富的语义信息。针对药物多类关系抽取问题,提出一种两阶段方法:在第一阶段,采用实例表示与句法特征相结合的方法,利用逻辑回归分类器,识别出药物关系正例;在第二阶段,利用长短期记忆网络将正例分成四种药物关系类型。为了提升第二阶段性能,从重要度、实现代价和计算代价这三个方面考虑了多种相关要素对长短期记忆网络的影响,通过实验发现,词向量、距离向量、词性向量和双层双向长短期记忆网络对于第二阶段分类的性能具有提升作用,也是本文两阶段药物关系抽取方法能够取得较好效果的重要因素。综上所述,本文针对蛋白质间二分类关系抽取和药物间多分类关系抽取,利用表示和深度学习等技术提出相应的抽取方法,在一定程度上克服了基于特征和核函数方法的局限性,取得了较好的效果。词表示和深度学习技术是近年来的研究热点,在生物医学文本挖掘领域的起步较晚,本文所提出的方法在生物医学关系抽取任务上取得了一定成果,验证了其有效性,并揭示了基于词表示和深度学习方法在生物医学文本挖掘领域具有广阔的研究空间,值得在未来工作中继续探索。
[Abstract]:Protein relation extraction and drug relationship extraction are of great significance to the construction of biomedical database, life science research, drug development and disease prevention and treatment. At present, a large number of biomedical relation extraction methods focus on the selection of feature sets and the design of kernel functions. After more than a decade of development, the methods based on features and kernel functions have been relatively mature, and the lifting space has become limited. To further improve performance, this paper studies extraction methods based on word representation and depth learning. Depth learning can build deeper relational extraction model to improve the extraction effect, and word representation fusion semantic information into word vector is the premise of deep learning. The main contributions of this paper are as follows: according to the characteristics of biomedical text, a word representation model is designed. Based on the traditional word representation model, five kinds of important information, such as lexical form, word-of-speech, stem, syntactic block and biomedical named entity, are fused. The ability of semantic representation of word vectors is enhanced, and good results are obtained in the tasks of protein relation extraction and drug relation extraction, which verify the effectiveness of incorporating part of speech and entity into word representation. It provides a good basis for relation extraction based on deep learning. In order to overcome the limitation of traditional methods, which depend on feature and kernel function, an extraction model based on case representation is proposed. The model consists of three parts: word vector, skeleton feature and feature combination. The effect of extraction on large scale corpus is up to the present advanced level, which verifies the validity of the method based on word representation and depth learning in the extraction of protein relationship. The model considers the characteristics of the case of protein relation, takes word vector as input, and combines skeleton feature and vector, so as to fuse rich semantic information in case representation. In order to solve the problem of drug multi-class relation extraction, a two-stage method is proposed: in the first stage, the method of case representation combined with syntactic features is used to identify the positive case of drug relationship by using logical regression classifier, and in the second stage, By using long-term and short-term memory networks, the positive cases are divided into four types of drug relationships. In order to improve the performance of the second stage, the effects of many related factors on the long-term and short-term memory network are considered from the three aspects of importance, realization cost and computational cost. Part of speech vector and double-layer bidirectional long-term and short-term memory network can improve the performance of the second stage classification, which is also an important factor that the two-stage drug relationship extraction method can achieve better results. To sum up, this paper proposes a new extraction method based on the techniques of representation and depth learning, aiming at the extraction of the two-class relationship between proteins and the multi-classification relationship between drugs. To some extent, the limitation of the method based on feature and kernel function is overcome, and good results are obtained. The technology of word representation and deep learning has been a hot research topic in recent years, and it started late in the field of biomedical text mining. The method proposed in this paper has achieved some results in the task of biomedical relation extraction, and verified its effectiveness. It is also revealed that the word representation and depth learning methods have a wide research space in biomedical text mining field, which is worthy of further exploration in the future work.
【学位授予单位】:大连理工大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP391.1
【相似文献】
相关期刊论文 前1条
1 朱万颖;张希府;高志强;;句法模式的泛化及其在关系学习中的应用[J];重庆工学院学报(自然科学版);2008年10期
相关会议论文 前1条
1 虞欢欢;陈九昌;钱龙华;周国栋;;基于树核函数的中文语义关系抽取[A];中国计算机语言学研究前沿进展(2007-2009)[C];2009年
,本文编号:2060941
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2060941.html