生物医学文本中药物信息抽取方法研究

发布时间：2019-06-10 07:11

【摘要】：随着生物医学研究及互联网技术的发展,互联网上可获取的生物医学文献数量急剧增长。海量非结构化的生物医学文献中蕴含着丰富的、有价值的知识。药物作为一种被广泛研究的生物医学实体,是相关知识的重要载体。从非结构化的生物医学文本中抽取出结构化的药物信息既能服务于相关领域的研究人员与医疗专业人员,又能扩充、更新现有的药物知识库。因此,生物医学文本中的药物信息抽取获得越来越多的关注,逐渐成为研究的热点。当前药物信息抽取的研究主要集中在药物名识别及药物之间相互作用关系抽取两个问题上,相关方法的性能尚不能满足实际应用的需要。因此,本文围绕这两个问题展开深入研究。主要研究内容包括以下几个部分:第一,基于多语义特征融合的药物名识别方法。基于药物名词典的语义特征对识别药物名具有很大帮助,被广泛用于基于机器学习的药物名识别方法中。但由于药物名词典覆盖范围有限、更新不及时等原因,基于药物名词典的语义特征存在一定的局限性。本文注意到大规模非结构化的生物医学文献中包含大量未登录的药物名。为弥补基于词典的语义特征的不足,本文提出一种基于多语义特征融合的药物名识别方法。该方法利用大规模非结构化的生物医学文献生成基于词向量的语义特征,并将其与基于药物名词典生成的语义特征联合用于药物名识别。实验结果表明,基于多语义特征融合的药物名识别方法性能优于使用单一语义特征的方法。第二,基于特征组合与特征选择的药物名识别方法。特征组合是指将多个不同类型的简单特征组合为一个组合特征。相比于简单特征,组合特征的优势在于其能表示语句中词的多个属性。在药物名识别问题中,可能的特征组合方式很多,直接将简单特征组合会产生数量庞大的组合特征,且包含大量噪声,影响模型的性能。因此,除了n元文法特征外,现有的药物名识别方法通常仅使用简单特征。为了有效利用组合特征,本文提出了一种面向药物名识别的特征生成框架。该框架包含特征组合与特征选择两个模块,特征组合模块将简单特征组合得到组合特征,特征选择模块去除特征集合中的大量噪声。本文基于该框架将词向量特征、词典特征及通用特征组合,将得到的特征用于条件随机场模型进行药物名识别。实验结果表明,基于特征组合与特征选择的药物名识别方法性能优于仅使用简单特征的药物名识别方法。第三,基于文本序列卷积神经网络的药物相互作用关系抽取方法。现有的性能较好的药物相互作用关系抽取方法是基于支持向量机的方法。这类方法使用大量的人工定义特征且需要各种外部自然语言处理工具来生成这些特征。因此,其性能受外部自然语言处理工具的影响较大。为了减少对外部自然语言处理工具的依赖,本文提出一种基于文本序列卷积神经网络的药物相互作用关系抽取方法。该方法只需要输入由无监督的深度学习算法得到的词向量以及随机初始化的位置向量,通过文本序列卷积与最大池化操作自动学习得到特征,用于softmax分类器进行关系抽取。实验结果表明,该方法性能优于传统的基于支持向量机的方法。第四,基于依存结构卷积神经网络的药物相互作用关系抽取方法。基于文本序列卷积神经网络的药物相互作用关系抽取方法忽略了词之间的长距离依存关系,而这种依存关系对药物相互作用关系抽取很重要。因此,本文提出一种基于依存结构卷积神经网络的药物相互作用关系抽取方法,将词之间的长距离依存关系融入卷积神经网络模型。实验结果表明,引入词之间的长距离依存关系能提升药物相互作用关系抽取的性能。句法分析器对长句的依存句法分析结果错误较多,这些错误传播到依存结构卷积神经网络模型中,会影响模型的性能。为避免错误传播,本文根据语句长度将基于文本序列与基于依存结构的卷积神经网络方法组合。实验结果表明,这种组合能进一步提升药物相互作用关系抽取的性能。
[Abstract]:With the development of biomedical research and Internet technology, the number of biomedical literature available on the Internet has increased dramatically. The mass of unstructured biomedical literature contains rich and valuable knowledge. As a biomedical entity that is widely studied, the drug is an important carrier of relevant knowledge. Extracting the structured drug information from the unstructured biomedical text can serve both the researchers and the medical professionals in the relevant field, and can be expanded and updated to update the existing drug knowledge base. As a result, more and more attention has been paid to the extraction of drug information in the biomedical texts, becoming the focus of the study. The current study of drug information extraction is mainly focused on the two problems of drug name recognition and drug-drug interaction, and the performance of the related methods can not meet the needs of the practical application. Therefore, this paper studies the two problems. The main research contents include the following parts: First, the method of drug name recognition based on multi-semantic feature fusion. The semantic feature of the drug-name dictionary has great help to identify the drug name, and is widely used in the drug name recognition method based on machine learning. However, the semantic features of the drug-name dictionary have some limitations due to the limited coverage of the drug-name dictionary and the non-timeliness of the update. It is noted in this document that large-scale unstructured biomedical literature contains a large number of unregistered drug names. In order to make up for the deficiency of the semantic features based on the dictionary, this paper proposes a method of drug name recognition based on multi-semantic feature fusion. The method utilizes large-scale unstructured biomedical literature to generate semantic features based on word vectors and is used in combination with the semantic features generated by the drug name dictionary for drug name recognition. The experimental results show that the performance of the drug name recognition method based on the multi-semantic feature fusion is superior to that of using a single semantic feature. And secondly, identifying the drug name based on the feature combination and the feature selection. A feature combination is to combine a plurality of different types of simple features into one combined feature. The advantage of a combination feature is that it can represent a number of attributes of a word in a statement, as compared to a simple feature. In the problem of drug name recognition, there are many possible combinations of features, which directly combine simple features to produce a large number of combined features, and contain a lot of noise and affect the performance of the model. Thus, in addition to the n-gram feature, the existing drug name recognition method generally uses only a simple feature. In order to effectively use the combination character, this paper presents a feature generation framework for drug-name recognition. The framework comprises a feature combination and a feature selection module, wherein the feature combination module combines the simple feature combination to obtain the combined feature, and the feature selection module removes a large amount of noise in the feature set. Based on the framework, the feature of the word vector, the character of the dictionary and the general characteristic combination are combined, and the obtained characteristics are used for the identification of the drug name with the airport model. The experimental results show that the performance of the drug name recognition method based on the feature combination and feature selection is superior to the drug name recognition method using only the simple feature. And thirdly, a method for extracting a drug interaction relationship based on a text-sequence convolution neural network. The traditional method for extracting the drug interaction relationship with good performance is based on a support vector machine. Such methods use a large number of human-defined features and require various external natural language processing tools to generate these features. As a result, its performance is greatly affected by the external natural language processing tool. In order to reduce the dependence of external natural language processing tools, this paper presents a method for extracting drug interaction relation based on a text-sequence convolution neural network. The method only needs to input the word vector obtained by the unsupervised depth learning algorithm and the randomly initialized position vector, and the feature is automatically learned through the convolution of the text sequence and the maximum pool operation, and is used for the relation extraction of the softmax classifier. The experimental results show that the method is superior to the traditional method based on the support vector machine. And fourthly, a method for extracting a drug interaction relationship based on a dependent structure convolution neural network. The method of drug-interaction relationship extraction based on the text-series convolution neural network ignores the long-distance dependence of words, which is important for the extraction of drug-interaction relationship. In this paper, a method for extracting the drug interaction relation based on the convolution neural network of the dependent structure is proposed, and the long-distance dependency relationship between the words is integrated into the convolution neural network model. The experimental results show that the long-distance relationship between the words can improve the performance of drug interaction. The syntax analysis of the long sentences has many errors, and these errors are propagated to the dependent structure convolution neural network model, which can affect the performance of the model. In order to avoid the error propagation, this paper combines a text-based sequence with a dependent structure-based convolution neural network method according to the length of the sentence. The experimental results show that this combination can further improve the performance of drug interaction.
【学位授予单位】：哈尔滨工业大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】