当前位置:主页 > 科技论文 > 自动化论文 >

面向医疗领域的中文命名实体识别

发布时间:2018-10-29 12:00
【摘要】:随着近几年文本数据量的爆炸式增长、大规模知识库的建立和普及,命名实体识别研究已经逐渐成为自然语言处理领域的一大研究热点。然而,传统的基于有监督学习的方法,需要大规模的标注语料。在标注语料稀缺的医疗领域,传统的命名实体识别方法并不能够达到理想的效果。随着深度学习的火热发展和普及,循环神经网络(RNN,Recurrent Ne ural Network),尤其是长短期存储单元LSTM(Long-Short Term Memory)被广泛应用于自然语言处理领域,并在多个研究方向上取得显著高于传统方法的成绩。因此,我们首先利用LSTM模型进行医疗领域的命名实体识别的研究,并证明其无论是在研究效果评价还是实际应用层面,都能够达到比传统的条件随机场模型(CRF,Conditional Random Fields)更好的效果。由于医疗领域的规范的标注语料相对稀少,我们在LSTM模型已经取得比CRF模型更好的效果的基础上,还希望它能够通过融合外部信息,同时学习到新闻领域的语言学特征和医疗领域的无监督语义信息,达到更好的效果。我们利用了深度学习中迁移学习和预训练的相关知识,对医疗领域的模型进行了参数融合和模型调优,使得模型的效果进一步提升。最后,由于LSTM模型在实际应用中的缺陷,我们希望能够利用另一种方法进行领域自适应的命名实体识别。为了找寻不同知识域的领域差异,我们进行了多组混合不同领域语料的对比实验进行分析和探究。并通过GB DT模型集成领域差异和无监督的医疗领域的语义向量进行命名实体识别的研究,取得了较好的研究效果。
[Abstract]:With the explosive growth of text data in recent years and the establishment and popularization of large-scale knowledge base, the research of named entity recognition has gradually become a research hotspot in the field of natural language processing. However, traditional methods based on supervised learning require large scale tagging corpus. In the medical field where tagging data is scarce, the traditional naming entity recognition method can not achieve the desired results. With the development and popularization of deep learning, cyclic neural network (RNN,Recurrent Ne ural Network), especially LSTM (long and short term memory unit) (Long-Short Term Memory), has been widely used in the field of natural language processing. And in many research directions, the results are significantly higher than the traditional methods. Therefore, we first use the LSTM model to study the named entity recognition in medical field, and prove that it can achieve more than the traditional conditional random field model (CRF,), both in the evaluation of the research effect and in the practical application level. Conditional Random Fields) works better. Because the standard annotated corpus in the medical field is relatively scarce, we hope that LSTM model can integrate external information on the basis that the LSTM model has achieved better results than the CRF model. At the same time, we learn the linguistic features of the news field and the unsupervised semantic information in the medical field to achieve better results. We make use of the knowledge of transfer learning and pre-training in deep learning to fuse the parameters and optimize the models in the medical field, so that the effectiveness of the model can be further improved. Finally, due to the defects of LSTM model in practical application, we hope to use another method for domain adaptive named entity recognition. In order to find out the domain differences of different knowledge domains, we conducted a comparative experiment of mixing different domain corpus to analyze and explore. The named entity recognition is studied by integrating the semantic vectors of domain difference and unsupervised medical field with GB DT model, and good results are obtained.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1;TP18

【参考文献】

相关期刊论文 前6条

1 王鹏远;姬东鸿;;基于多标签CRF的疾病名称抽取[J];计算机应用研究;2017年01期

2 苏娅;刘杰;黄亚楼;;在线医疗文本中的实体识别研究[J];北京大学学报(自然科学版);2016年01期

3 曲春燕;关毅;杨锦锋;赵永杰;刘雅欣;;中文电子病历命名实体标注语料库构建[J];高技术通讯;2015年02期

4 栗伟;赵大哲;李博;彭新茗;刘积仁;;CRF与规则相结合的医学病历实体识别[J];计算机应用研究;2015年04期

5 张金龙;王石;钱存发;;基于CRF和规则的中文医疗机构名称识别[J];计算机应用与软件;2014年03期

6 邱莎;段玻;申浩如;丁海燕;;基于条件随机场的中文人名识别研究[J];昆明学院学报;2011年06期

相关会议论文 前1条

1 张祝玉;任飞亮;朱靖波;;基于条件随机场的中文命名实体识别特征比较研究[A];第四届全国信息检索与内容安全学术会议论文集(上)[C];2008年

相关硕士学位论文 前1条

1 段超群;面向缺乏标注数据领域的命名实体识别的研究[D];哈尔滨工业大学;2015年



本文编号:2297637

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2297637.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户7cf92***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com