基于SVM和HMM算法的中文机构名称识别

发布时间：2018-04-20 05:35

本文选题：自然语言处理 + 命名实体识别　；参考：《吉林大学》2017年硕士论文

【摘要】：命名实体识别(Named Entity Recognition,NER)技术是信息抽取、信息检索、机器翻译、在线快速问答系统等多种自然语言处理(Natural Language Processing,NLP)技术不可缺少的组成部分。中文命名实体识别主要用来在一篇中文自然语言文本中识别中文人名、地名、机构名称、时间的表示、数量的表示、货币值的表示和百分比表示等命名实体。中文机构名称相比于其他中文命名实体,有结构复杂、组成多样等特点,是中文命名实体识别当中一个较为困难的部分。本文主要采用基于机器学习的方法,利用支持向量机(Support Vector Machine,SVM)和隐马尔可夫模型(Hidden Markov Model,HMM),并采用基于规则的方法加以辅助,完成对中文机构名称的识别。根据中文机构名称的构词特点,将中文机构名称分为机构名称后缀词和机构名称前缀词两部分。首先人工将所有机构名称后缀词提取出来,形成一个特征词典;然后判断收录在特征词典中的词出现在文本中时,是否是一个机构名称后缀词,即确定一个机构名称的后界。由此可见,可以将上述过程抽象成一个二分类的问题,而SVM模型在解决二分类问题上有着明显的优势,所以本文使用SVM模型来确定中文机构名称的后界。由于中文机构名称前缀词有结构复杂、组成多样、长短不一等多个特点,所以在识别过程中有较大的难度,但是在已经确定文本中机构名称后缀词的情况下,再识别前缀词恰好符合HMM模型,所以本文提出使用HMM模型来识别中文机构名的各个前缀词,即确定中文机构名称的前界。在确定中文机构名称前界和后界之后,中文机构名称就被正确识别。实验表明,运用SVM模型和HMM模型相结合的方法是有效的,并取得了较好的识别效果。在封闭测试中,精确率、召回率和F值最高能够达到96.29%、88.70%、92.34%;在开放测试中,精确率、召回率和F值最高能够达到90.17%、81.94%、85.61%。
[Abstract]:Named Entity recognition (ner) technology is an indispensable part of many natural language processing technologies, such as information extraction, information retrieval, machine translation, online quick question and answer system, etc. Chinese nomenclature entity identification is mainly used to identify Chinese names, place names, organization names, time, quantity, currency value and percentage in a Chinese natural language text. Compared with other Chinese named entities, Chinese institutional names have the characteristics of complex structure and diverse composition, which is a difficult part of Chinese named entity recognition. This paper mainly adopts the method of machine learning, using support vector machine (SVM) and Hidden Markov Model (HMMM) and Hidden Markov Model (HMMM), and using rule-based method to complete the recognition of the names of Chinese institutions. According to the word-formation characteristics of the Chinese organization name, the Chinese organization name is divided into two parts: the institution name suffix and the institution name prefix word. First, all the suffixes of institutional names are extracted artificially to form a feature dictionary; then, when the words included in the feature dictionary appear in the text, whether they are a suffix word of the name of an organization is determined, that is, the latter bound of the name of an organization is determined. Thus, the process can be abstracted into a two-classification problem, and the SVM model has obvious advantages in solving the two-classification problem, so this paper uses the SVM model to determine the back bound of the name of the Chinese institution. Because of the complex structure, diverse composition, different length and other characteristics of the Chinese institutional name prefix, it is more difficult in the recognition process, but in the case that the suffix word of the organization name has been determined in the text, The recognition of prefixes coincides with the HMM model, so this paper proposes to use the HMM model to identify the prefixes of Chinese institutional names, that is, to determine the front bounds of the names of Chinese institutions. After determining the front and back bounds of the Chinese organization name, the Chinese organization name is correctly recognized. The experimental results show that the combination of SVM model and HMM model is effective and the recognition effect is good. In the closed test, the highest accuracy rate, recall rate and F value can reach 96.299.78.70 and 92.34; in open test, the highest accuracy rate, recall rate and F value can reach 90.170.94 and 85.61.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】