基于循环神经网络的中文人名识别的研究

发布时间：2018-05-20 10:19

本文选题：中文人名识别 + 词向量　；参考：《大连理工大学》2016年硕士论文

【摘要】：中文人名识别任务是中文信息处理领域中的基础任务,其性能的好坏将直接影响到其他任务的性能。中文人名的随意性使其在未登录词中占有较大的比重,解决未登录词识别问题首先要解决人名识别问题。因此,解决中文人名识别问题具有重要的意义。现有基于统计的中文人名识别方法存在特征选取复杂和人工干预等问题,针对这些问题,本文提出了一种基于循环神经网络(Recurrent Neural Networks)的中文人名识别方法,该方法仅采用词向量作为模型的特征且无需人工干预,有效降低了特征选取的复杂性和人工干预对实验造成的影响。此外,词向量可以通过大量未标注的中文数据训练获得,然后将蕴含丰富语义信息的词向量作为循环神经网络模型的输入,可以使模型学习到更多的信息,提升模型的性能。本文将模型分为两个阶段：模型构建阶段和后处理阶段。在模型构建阶段,我们将重点放在词向量的优化策略上。针对词向量的优化问题,本文提出了三种策略：(1)将word2vec训练得到的词向量替换循环神经网络模型的随机初始词向量(2)对词向量训练语料进行数词泛化操作(3)改进word2vec模型,将特征信息融入词向量实验结果表明,通过词向量的优化操作,中文人名识别模型的F值提高了2.23%。在后处理阶段,通过上下文规则对候选人名进行过滤；采用基于篇章的全局扩散操作召回在某一位置由于信息不足识别不出而在其他位置能够被识别的人名；使用基于篇章的局部扩散操作识别篇章信息中有名无姓或者有姓无名的人名。实验结果表明,通过规则过滤和扩散操作,中文人名识别模型的F值提高了4.74%。
[Abstract]:The task of Chinese name recognition is the basic task in the field of Chinese information processing, and its performance will directly affect the performance of other tasks. The randomness of Chinese names makes them occupy a large proportion in unrecorded words. To solve the problem of unrecorded words recognition, we must first solve the problem of personal name recognition. Therefore, it is of great significance to solve the problem of Chinese name recognition. The existing Chinese name recognition methods based on statistics have the problems of complex feature selection and artificial intervention. In view of these problems, this paper proposes a Chinese name recognition method based on cyclic neural network (Recurrent Neural Network). This method only uses word vector as the feature of the model and does not need human intervention, which effectively reduces the complexity of feature selection and the influence of artificial intervention on the experiment. In addition, the word vector can be obtained through a large number of unlabeled Chinese data training, and then the word vector with rich semantic information can be used as the input of the cyclic neural network model, so that the model can learn more information and improve the performance of the model. This paper divides the model into two stages: model construction stage and post-processing phase. In the stage of model construction, we focus on the optimization strategy of word vector. To solve the problem of word vector optimization, this paper proposes three strategies: 1) the word vector is replaced by the random initial word vector of the neural network model, which is trained by word2vec, and the random initial word vector is used to generalize the word vector training corpus. (3) the word2vec model is improved. The experimental results show that the F value of the Chinese name recognition model is increased by 2.233 by the optimization of the word vector. In the post-processing stage, the candidate's name is filtered by contextual rules, and the text based global diffusion operation is used to recall the names of people who can be recognized in other places because of the lack of information. A text-based local diffusion operation is used to identify a person with no or no name in the text information. The experimental results show that the F value of the Chinese name recognition model is increased by 4.74 by regular filtering and diffusion operation.
【学位授予单位】：大连理工大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1;TP183

【参考文献】