当前位置:主页 > 经济论文 > 技术经济论文 >

老挝语命名实体识别研究

发布时间:2018-02-09 12:24

  本文关键词: 机构名识别 双层模型 半监督学习 条件随机场(CRF) 分歧 命名实体识别 支持向量机(SVM) 老挝语 出处:《昆明理工大学》2017年硕士论文 论文类型:学位论文


【摘要】:命名实体识别(NER),自从命名实体这个任务提出以来一直作为自然语言处理领域的重要基础工作任务。在老挝语方面,命名实体的研究工作还是相当薄弱,随着我国与老挝政治经济交往日益密切,老挝语的信息化处理也在两国的经济文化交流十分重要,因此为了更好的顺应两国经济、政治等各个方面的发展,对老挝语的命名实体识别的研究是必要且不可或缺的。本文针对老挝语特有的命名实体的特征,以及目前老挝语命名实体语料稀缺的问题。主要针对老挝语的地名、人名和组织机构名的识别方法进行研究。主要研究成果如下:(1)基于分歧的老挝语命名实体识别针对老挝语的特点研究老挝语命名实体识别,主要的问题就是老挝语命名实体的语料稀缺,并且获取速度较慢,在国内外的研究还十分少,仅仅靠网上资源,以及专家老师、老挝学生的人工标注,所获得的语料对于研究是远远不够的,针对这种情况,本文提出了一种基于分歧的老挝语命名实体识别算法,首先通过有标记的老挝命名实体语料训练3个有监督分类器,本文采用的是条件随机场CRF进行训练,进而通过三个分类器分别训练相同的未标记语料,在这个过程中我们主要采用分类加权的投票策略对没有标记的样本进行初步标记。其次,对初步标记的语料进行第二次验证,最后把新增的样本添加到我们已有的老挝语料集中。(2)基于层叠条件随机场的老挝机构名识别通过上述的实验我们扩充了一部分实验的语料集,在实验室之前的研究中,通过单层的条件随机场及基于规则和统计结合的方法,对老挝语人名、地名进行识别。在小规模语料的实验中,已经取得了不错的识别结果。但是针对老挝机构名的实体识别,还没有专门的研究,而且由于老挝机构名中,含有许多嵌套的名词,仅仅通过单层模型是很难识别的,因此,本文提出了一种基于层叠条件随机场模型老挝语机构名识别算法。这个算法主要是利用两层条件随机场对老挝机构名进行识别,首先在第一层,我们主要通过识别简单的老挝人名、老挝地名、以及老挝机构名,并且结合观察值把结果传递给第二层的条件随机场模型。在第二层条件随机场模型中,我们结合第一步分结果,制定出相应的老挝语特征模板,实现对老挝复杂组织机构名的识别。实验结果表明对老挝机构名的识别有不错的效果。(3)基于条件随机场和支持向量机的双层模型的老挝机构名别在深入分析了老挝语机构的一些构成特点后,我们发现在老挝机构名的特征中,大部分老挝机构名都会有一个边界特征词,如果我们通过专门识别老挝机构名的边界特征词进而识别老挝机构名,识别率应该会有所提高。而上面所提出的基于层叠条件随机场的方法并不能很好的解决这个问题。因此本文针对老挝机构名的边界识别问题,提出了另一种老挝机构名的识别方法。应用条件随机场和支持向量机的的混合方法来识别老挝的机构名。在这个方法中,首先,在第一层,我们主要通过识别简单的老挝人名、老挝地名、以及老挝机构名,并且把结果结合观察值后的结果再传递给第二层模型(支持向量机模型),在第二层,我们采用基于驱动的方法通过识别老挝机构名的边界特征,对老挝机构名进行识别。并且在最后我们通过置信的计算对老挝机构名识别结果进行一个修正。实验结果表明针对老挝机构名边界对老挝机构名的识别的正确率有了明显的提高。
[Abstract]:Named entity recognition (NER), has been an important task in the field of Natural Language Processing based named entity since this task is put forward. In the Lao language, the research work of named entities is quite weak, with China and Laos political and economic exchanges increasingly close, information processing in Lao is also important in the economic and cultural exchanges between the two countries therefore, in order to better adapt to the development of bilateral economic, political and other aspects of the study, named entity recognition of Lao is necessary and indispensable. In this paper, according to the characteristics of Lao special named entities, as well as the current problems of Lao ne corpus scarce. Mainly for Lao names, studied the recognition method the names and organization names. The main research results are as follows: (1) the Lao language differences of named entity recognition according to the characteristics of the research based on Lao The Lao language named entity recognition, the main problem is that the Lao language named entity corpus is scarce, and gets slower, the research at home and abroad are very few, only rely on online resources, as well as the expert teachers, labeled Lao students, the corpus for research is not enough, in view of this situation, this paper named entity recognition is proposed based on a different Lao, the first 3 supervised classifier named entity corpus training by marked Laos, is adopted in this paper CRFs CRF training, and through the three classifiers are training the same unlabeled corpus, in this process, we mainly use the weighted classification the voting strategy preliminary labeling on unlabeled samples. Secondly, the initial labeled corpus second times to verify, finally the new samples are added to our existing The central Laos corpus. (2) identify cascaded conditional random fields based on the mechanism of Laos experiment we extend the experimental part of the corpus, in the research laboratory before, through the monolayer of CRFs and method combining rules and statistics based on the names of Lao, name recognition. The small scale corpus in the experiment has achieved good recognition results. But in Laos organization name entity recognition, there is no specialized research, and because the organization name in Laos, the noun contains many nested, only by single model is difficult to identify, therefore, this paper proposes a method based on cascaded conditional random the airport Lao model organization name recognition algorithm. This algorithm is mainly based on two layer CRFs to identify the mechanism of Laos, in the first layer, we mainly through the simple recognition of the Lao People 鍚,

本文编号:1497922

资料下载
论文发表

本文链接:https://www.wllwen.com/jingjilunwen/jiliangjingjilunwen/1497922.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户914b2***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com