当前位置:主页 > 文艺论文 > 语言学论文 >

基于蒙古文语料库的人名自动识别

发布时间:2018-06-25 14:22

  本文选题:蒙古文信息处理 + 语料库用现代蒙古文标注规范 ; 参考:《中央民族大学》2013年博士论文


【摘要】:蒙古文人名的自动识别是命名实体识别的子任务之一。 中、英文信息处理经历了半个世纪的发展,在基础资源的建设、词性标注、信息检索、文本分类、机器翻译、语言识别与合成、人机对话等领域都取得非常大的发展,中、英文信息处理的现代化发展,对国内少数民族语言信息处理的理论与技术发展也起到了深刻的促进作用。 与中、英文信息处理相比,蒙古文信息处理虽然起步稍晚,但也取得了少数民族信息处理领域的辉煌成就。蒙古文信息处理已初步完成了字、词处理阶段,现已进入句处理阶段,蒙古文信息处理已完成短语结构关系识别、短语边界界定等浅层句法分析任务,正向深层句法分析迈进,蒙古文信息检索、自动文摘、文本分类、机器翻译的研究也方兴未艾。 蒙古文词法分析与标注对短语、句法、语义、篇章的研究具有重要意义,不过作为基础环节的词法分析与标注,在未登录词,尤其是命名实体的识别研究未能繁荣发展。命名实体识别上的欠缺始终影响着词法分析的精度,并进而影响短语分析、句法分析、信息检索、机器翻译等领域的发展。 专有名词是语料库的重要组成部分,专有名词识别技术的突破是提高蒙古文词法分析正确率及其他后续工作的重要基础,歧义和未登录词的识别是影响切分精度的两大障碍,未登录词包括新词和人名、地名等命名实体。本文作为蒙古文人名自动识别的研究成果,涉及普通人名及兼类人名的识别,因而我们的研究具有相当高的学术价值及应用价值。 蒙古文本中人名数量众多,兼类现象较为普遍,研究蒙古人名的论述较少,尚无太多现成的理论与技术可供参考,因而蒙古文人名识别遇到很多难题,主要表现在: ☆人名是开放集合,无法采取穷举方法。蒙古族人名兼类现象较为严重,越普通的词,成为人名的现象也越普遍,名词、动词、形容词、数词、时间词、副词、代词、模拟词都能成为人名,这给人名识别带来很大困难。 ☆蒙古文深加工语料库规模比起中、英文规模尚小,这必定影响到统计方法的运用。目前内蒙古大学已储备了200万词规模深加工语料库,而我们使用26万词规模语料库,语料库的规模使规则提取及机器学习受到一定限制。 ☆专有名词的识别一直是蒙古文词法分析与标注的难点问题,但人名易与地名及其他专有名词兼类,因而专有名词之间的兼类问题也是困扰我们的难点问题。 本文采用了最大熵的统计方法识别蒙古文人名,在传统的规则为主的研究基础上,将最大熵的数学模型成功应用于蒙古文命名实体的识别当中,实现了蒙古文人名自动识别系统。本文的创新和贡献主要体现在: ◇首次建立了蒙古文人名识别语料库 目前,蒙古文语料库已具备了一定的规模,这对蒙古文信息处理的繁荣发展起到良好的推动作用。不过迄今为止,国内外还没有建立专门面向蒙古文人名识别的语料库,我们从网络抓取5773个蒙古文人名句,与内蒙古大学的语料库一同训练识别模型,测试自动识别的结果,有效补充了语料库缺乏带来的缺憾。 ◇系统研究了蒙古族人名的内外部结构 我们深入分析了蒙古人名的民族特征、时代特征、地域特征、性别特征,深入总结了蒙古文人名的内部组成模式,对蒙古族人名的结构类型及特点,对蒙古族特有的蒙古姓氏及其来源进行解读。 ◇提出了蒙古文语料库标注及转写规范 我们在对蒙古文语料库的标注现状进行分析的基础上,提出了,“语料库用现代蒙古语标注规范”,并针对汉语人名标注的诸多问题,以蒙古文标注外来词的固定习惯为基础,以《现代蒙古语语料库标注规范》为参考,提出了详尽的“汉语人名的拉丁转写方案”。 ◇建立人名识别的知识库 我们为自动识别蒙古文人名,建立了包括“汉语姓氏词典、蒙古姓氏词典、蒙古族普通人名词典、汉语姓氏拉丁映射表、汉语人名拉丁映射表、梵藏满人名词典、著名人物词典、人名指示词库、地名词典、地名后缀词典、机构名后缀词典”等词典或映射表的普通人名知识库,建立了包含“兼类人名词典、兼类词搭配词典、蒙古人名词干词典”等知识的兼类人名知识库。 ◇设计并实现了蒙古文人名自动识别系统 实验证明,作为国内外在蒙古文命名实体识别中较早运用统计方法的学术成果,本研究封闭测试的正确率94.56%,召回率85.15%,F值89.61%,取得了较为满意的识别效果。
[Abstract]:Automatic recognition of Mongolian names is one of the sub tasks of named entity recognition.
English information processing has gone through the development of half a century. It has made great progress in the construction of basic resources, part of speech tagging, information retrieval, text classification, Machine Translation, language recognition and synthesis, human-computer dialogue and so on, the modernization of information processing in English and Chinese, and the theory and technology of the domestic minority language information processing. Development has also played a profound role in promoting.
Compared with Chinese and English information processing, Mongolian information processing is a little late, but it has also achieved brilliant achievements in the field of minority information processing. The Mongolian information processing has already completed the initial word, the word processing stage has now entered the sentence processing stage, the Mongolian information processing has completed the phrase structure relationship identification, the phrase boundary definition and so on shallow. The task of layer syntactic analysis is going deep into syntactic analysis, Mongolian information retrieval, automatic summarization, text categorization, and Machine Translation's research is also in the ascendant.
The analysis and tagging of Mongolian words are of great significance to the study of phrase, syntax, semantics and text. However, as the basic link of the lexical analysis and annotation, the research on the recognition of the unregistered words, especially the named entity, has not flourish. The lack of the named entity recognition affects the accuracy of the lexical analysis and then affects the phrase division. Analysis, syntax analysis, information retrieval, Machine Translation and other fields of development.
The proper noun is an important part of the corpus. The breakthrough of the know-how recognition technology is an important basis for improving the accuracy of the Mongolian word analysis and other follow-up work. The identification of ambiguous and unregistered words is the two major obstacle that affects the accuracy of the segmentation. The unregistered words include the new words and names, the names of the names, and other naming entities. The research results of automatic recognition of names involve the recognition of common names and congeneric names. Therefore, our research has high academic value and application value.
The number of names in the Mongolia text is numerous and the phenomenon of concurrently is more common. There are few treatise on the study of the names of people in Mongolia. There are not too many ready-made theories and techniques for reference. Therefore, there are many difficult problems in the recognition of Mongolian People's names, which are mainly manifested in:
The names of people are more serious, the more common words, the more common the phenomenon is, the more common the phenomenon is, the more common, the noun, the verb, the adjective, the numerals, the time words, the adverbs, the pronoun, the analogue words can all become the names, which brings great difficulties to the name recognition.
The scale of Mongolian deep processing corpus is still small in scale, which must affect the use of statistical methods. At present, the Inner Mongolia University has already stored 2 million word large processing corpus, and we use 260 thousand word corpus, and the scale of corpus has limited the rule extraction and machine learning.
The recognition of proper nouns has always been a difficult problem in the analysis and annotation of Mongolian words, but the names of people are easy to combine with the place names and other proper nouns, so the problem of concurrently between the proper nouns is also a difficult problem.
This paper uses the maximum entropy method to identify the Mongolian names. On the basis of the traditional rule based research, the mathematical model of maximum entropy is successfully applied to the recognition of Mongolian named entity, and the Mongolian name automatic recognition system is realized. The innovation and contribution of this paper are mainly embodied in the following:
For the first time, the Mongolian name recognition corpus was established.
At present, the Mongolian corpus has a certain scale, which has played a good role in the prosperity and development of Mongolian information processing. But up to now, there has not been a corpus of Mongolian name recognition at home and abroad. We have grabbed 5773 Mongolia literati from the network and trained with the corpus of Inner Mongolia University. Training the recognition model and testing the results of automatic recognition effectively complement the deficiency of corpus.
The internal and external structure of Mongolian names is systematically studied.
We deeply analyze the ethnic characteristics of the names of the people in Mongolia, the characteristics of the times, the geographical features and the sex characteristics, and the internal composition patterns of the Mongolian names, the structure types and characteristics of the Mongolian names, and the interpretation of the unique Mongolia surnames and their sources.
We put forward the specification of Mongolian corpus annotation and transcription.
On the basis of the analysis of the status quo of the Mongolian corpus tagging, we put forward, "corpus with modern Mongolian tagging", and in view of the many problems of Chinese name tagging, based on the fixed habits of Mongolian annotation of loanwords, and with the reference of "modern Mongolian tagging specification >" as a reference, a detailed "Chinese" is put forward. A Latin Transliteration scheme for the name of a person.
A knowledge base for the establishment of name recognition
In order to automatically identify the names of Mongolian people, we have established the words "Chinese surname dictionary, Mongolia surname dictionary, Mongolian general name dictionary, Chinese surname Latin mapping table, Chinese name Latin map table, Sanskrit full name dictionary, famous figure dictionary, name indicator dictionary, place name dictionary, place name suffix dictionary, institution name suffix dictionary" and so on. The common name knowledge base of the book or the mapping table has established the knowledge base of the names of people with the knowledge of "concurrently name dictionary, concurrently word collocation dictionary, Mongolian noun dictionary" and so on.
The automatic recognition system of Mongolian names is designed and implemented.
The experiment proves that as the academic achievement of the early use of statistical methods in Mongolian naming entity recognition at home and abroad, the correct rate of the closed test is 94.56%, the recall rate is 85.15%, and the F value is 89.61%, and the satisfactory recognition results have been obtained.
【学位授予单位】:中央民族大学
【学位级别】:博士
【学位授予年份】:2013
【分类号】:H212;H087

【参考文献】

相关期刊论文 前10条

1 齐心;蒙古人名论析[J];解放军外语学院学报;1998年05期

2 胡冠龙;张建;李淼;;改进的基于转换方法的拉丁蒙文词性标注[J];计算机应用;2007年04期

3 俞士汶,段慧明,朱学锋,孙斌;北京大学现代汉语语料库基本加工规范[J];中文信息学报;2002年05期

4 俞士汶,段慧明,朱学锋,孙斌;北京大学现代汉语语料库基本加工规范(续)[J];中文信息学报;2002年06期

5 罗智勇,宋柔;一种基于可信度的人名识别方法[J];中文信息学报;2005年03期

6 黄昌宁;赵海;;中文分词十年回顾[J];中文信息学报;2007年03期

7 姜文斌;吴金星;乌日力嘎;那顺乌日图;刘群;;蒙古语有向图形态分析器的判别式词干词缀切分[J];中文信息学报;2011年04期

8 小林高四郎;乌恩;;蒙古族的姓氏和亲属称谓[J];蒙古学资料与情报;1987年01期

9 H·赞巴拉苏荣;白永寿;;蒙古人的藏语名[J];蒙古学资料与情报;1988年03期

10 侯宏旭;刘群;那顺乌日图;牧仁高娃;李锦涛;;基于统计语言模型的蒙古文词切分[J];模式识别与人工智能;2009年01期

相关博士学位论文 前4条

1 周雅倩;最大熵方法及其在自然语言处理中的应用[D];复旦大学;2005年

2 达胡白乙拉;蒙古语基本动词短语自动识别研究[D];内蒙古大学;2005年

3 雪艳;汉蒙词语对齐及相关技术研究[D];内蒙古大学;2009年

4 淑琴;蒙古文同形词知识库的构建[D];内蒙古大学;2010年

相关硕士学位论文 前10条

1 吴金星;蒙古语词法标注语料库的构建及相关技术研究[D];内蒙古大学;2011年

2 张丽静;规则与统计相结合的兼类词处理机制[D];大连理工大学;2002年

3 淑琴;《蒙古语语法信息词典构形附加成分分库》的设计与实现[D];内蒙古大学;2005年

4 乔永波;规则与统计相结合的中文命名实体识别[D];山东大学;2007年

5 图格木勒;蒙古语语言资源库建设相关技术研究[D];内蒙古大学;2007年

6 格根塔娜;苏尼特左旗蒙古族人名研究[D];内蒙古大学;2007年

7 图雅;科尔沁蒙古族人名研究[D];内蒙古师范大学;2007年

8 赵琳瑛;基于隐马尔科夫模型的中文命名实体识别研究[D];西安电子科技大学;2008年

9 牧仁高娃;蒙古语语料库标注及相关对策研究[D];内蒙古大学;2008年

10 萨楚日;鄂尔多斯蒙古族人名变化研究[D];内蒙古大学;2009年



本文编号:2066304

资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/yuyanxuelw/2066304.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户9f53d***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com