中文命名实体识别算法研究

发布时间：2018-03-05 08:55

本文选题：中文命名实体识别　切入点：混合模型　出处：《浙江大学》2017年硕士论文　论文类型：学位论文

【摘要】：命名实体识别(Name Entity Recognition,NER)是指识别文本中具有特定意义的实体,主要包括人名、地名、组织机构名等,是将非结构化数据转为结构化数据的一个重要技术手段,是计算机正确理解文本信息的关键步骤,也是信息抽取、情感分析、问答系统等多个自然语言处理应用的基础任务,因此命名实体识别的研究存在着重要意义。但由于中文语言自身的特点,中文命名实体仍存在许多难点,其主要难点包括:(1)中文命名实体识别通常是基于单一模型的识别,这些模型具有各自的优缺点和局限性。(2)中文命名实体识别通常是基于词序列的识别,需要借助中文分词技术,中文命名实体识别的效果往往依赖于中文分词的准确率。本文的研究内容和主要工作包括:(1)调研了国内外命名实体识别的相关工作,总结和实现了主流的命名实体识别方法,分析和比较了这些方法的优缺点,为本文的后续工作提供了思路。(2)为了解决单一模型的局限性,本文结合了多个模型和使用多任务学习进行中文命名实体识别,该方法BiLSTM-CRF-MTL能够较好地解决单一模型的缺点,此外不需要过多的特征构造,模型通过多个相关任务进行特征学习。(3)为了解决基于词序列识别存在的问题,本文将基于字序列进行中文命名实体识别,引入基于外部语料和新词发现的词向量,同时将基于关键词提取的中文分词置信度作为特征来缓解中文分词带来的噪声。(4)为了让模型能够更好地拟合上下文和缓解标注样本较少的问题,本文提出了一种基于实体词替换的样本生成方法。本文基于1998年人民日报语料进行中文命名实体测评,对比了多个基于单一模型的识别方法以及相关文献的识别方法,实验结果表明,本文提出的方法取得了 88.79%的平均F1,相比其它方法具有较大提升。
[Abstract]:Named entity recognition (Entity recognition) refers to the entities with specific significance in the identification text, mainly including the names of persons, place names, organizations, etc., which is an important technical means to transform unstructured data into structured data. It is a key step for computer to understand text information correctly, and it is also the basic task of many natural language processing applications, such as information extraction, emotion analysis, question and answer system, etc. Therefore, the research of named entity recognition is of great significance. However, due to the characteristics of Chinese language, there are still many difficulties in Chinese named entity recognition, the main difficulties of which include: 1) Chinese named entity recognition is usually based on a single model. These models have their own advantages, disadvantages and limitations. (2) Chinese named entity recognition is usually based on word sequence recognition, which requires the help of Chinese word segmentation technology. The effect of Chinese named entity recognition often depends on the accuracy of Chinese word segmentation. This paper analyzes and compares the advantages and disadvantages of these methods, and provides a train of thought for the further work of this paper. In order to solve the limitation of single model, this paper combines multiple models and uses multi-task learning to identify Chinese named entities. This method, BiLSTM-CRF-MTL, can solve the shortcoming of single model well. In addition, it does not need too much feature construction. The model can learn features by several related tasks) in order to solve the problem of word sequence recognition. In this paper, the Chinese named entity recognition based on word sequence is introduced, and the word vector based on external corpus and new word discovery is introduced. At the same time, the confidence degree of Chinese word segmentation based on keyword extraction is used as a feature to alleviate the noise caused by Chinese word segmentation.) in order to make the model fit the context better and alleviate the problem of fewer labeled samples, This paper proposes a method of sample generation based on the substitution of entity words. This paper evaluates the Chinese named entities based on People's Daily corpus in 1998, and compares several recognition methods based on single model and related literature. The experimental results show that the proposed method achieves an average F _ 1 of 88.79%, which is much better than other methods.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】