基于Hadoop平台和隐马尔可夫模型的生物医学命名实体识别方法研究

发布时间：2018-03-30 04:08

本文选题：生物医学命名实体识别　切入点：隐马尔可夫模型　出处：《西北农林科技大学》2017年硕士论文

【摘要】：生物医学作为一门交叉性学科经过近年的不断发展,其专业知识量不断增加,与其相关的文本资料也越来越多。这些海量的文本资料中包含着许多有价值的信息和数据,目前基于大数据的生物医学文本挖掘技术的目的就是将这些有用信息从海量数据中提取出来以供研究者使用。生物医学命名实体识别工作是生物医学文本挖掘技术中的关键步骤。针对传统集中式的生物医学命名实体识别方法难以处理海量文本数据的问题,本研究在Hadoop平台上采用分布式计算方法进行命名实体识别模型训练并对大规模数据进行处理。研究过程主要可分为以下两部分:(1)在Hadoop平台上完成HMM模型的参数训练,通过统计训练语料库中初始状态的分布情况,状态与状态之间的转移次数,以及每个状态发射出观察值的分布,得到HMM模型的初始状态概率分布,状态转移概率矩阵和符号发射概率矩阵三个参数。为了验证HMM模型在Hadoop平台上的参数训练效率和命名实体识别性能,使用CRF模型与其进行对比。在Hadoop平台上并行化计算CRF模型中特征函数权重的梯度向量,并迭代计算出最优的模型参数。两个模型在Hadoop平台上的对比结果显示,在训练数据相同的情况下,CRF模型识别性能略高于HMM模型,但在Hadoop平台上进行模型训练时随着数据量的不断增大HMM模型训练效率远高于CRF模型。本文选用HMM模型在Hadoop平台上对大规模生物医学文本进行命名实体识别。(2)在Hadoop平台上使用HMM模型进行生物医学命名实体识别,该操作分为两个MapReduce过程:过程一,对测试数据进行数据清洗操作,去除产生噪声干扰的无用信息并得到新的测试数据;过程二,在Map阶段完成句子分割,标记分词和词性标注的处理过程,并将带有词性标签的句子作为输出发送给Reduce阶段;Reduce阶段调用维特比算法根据(1)中训练好的HMM模型参数对句子进行命名实体名称标记,并最终输出带有生物医学命名实体标签的句子。在Hadoop平台上的实验结果表明,面对大规模的生物医学文本使用Hadoop平台进行命名实体识别的效率远高于单机处理过程,可以节省大量处理时间。
[Abstract]:Biomedicine, as a cross-disciplinary subject, has been developing continuously in recent years, and its professional knowledge has been increasing, and more and more text materials are related to biomedicine. These vast amounts of text materials contain a lot of valuable information and data. The purpose of the current biomedical text mining technology based on big data is to extract the useful information from massive data for use by researchers. Biomedical named entity recognition is a biomedical text mining technique. Key steps during the operation. To solve the problem that traditional centralized biomedical named entity recognition method is difficult to deal with massive text data, In this study, the named entity recognition model is trained on Hadoop platform with distributed computing method and large-scale data is processed. The research process can be divided into the following two parts: 1) the parameter training of HMM model is completed on Hadoop platform. The initial state probability distribution of the HMM model is obtained by statistical analysis of the distribution of the initial state, the number of transitions between states and the distribution of observed values emitted from each state in the training corpus. In order to verify the parameter training efficiency and named entity recognition performance of HMM model on Hadoop platform, the state transition probability matrix and symbol transmit probability matrix are three parameters. The gradient vector of the eigenfunction weight in the CRF model is calculated by parallelization on the Hadoop platform, and the optimal model parameters are calculated iteratively. The comparison results of the two models on the Hadoop platform show that, by comparing the two models with the CRF model, the gradient vector of the eigenfunction weight in the CRF model is calculated in parallel. With the same training data, the recognition performance of CRF model is slightly higher than that of HMM model. However, the training efficiency of HMM model is much higher than that of CRF model with the increasing of data volume on Hadoop platform. This paper chooses HMM model to identify large-scale biomedical text on Hadoop platform. HMM model is used to identify biomedical named entities on the platform. The operation is divided into two MapReduce processes: one is to clean the test data to remove the unwanted information and get the new test data, the other is to complete the sentence segmentation in the Map phase. The process of tagging words and parts of speech is processed, and the sentences with part of speech labels are sent as output to the Reduce stage / reduce stage to call Viterbi algorithm according to the trained HMM model parameters in the Reduce stage to mark the named entity names of the sentences. Finally, the sentences with biomedical named entity tags are output. The experimental results on Hadoop platform show that the efficiency of using Hadoop platform to recognize named entities in large-scale biomedical texts is much more efficient than that in the process of single machine processing. Can save a lot of processing time.
【学位授予单位】：西北农林科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R318

【参考文献】