基于统计和规则的中文人名识别研究与实现
发布时间:2018-08-17 10:38
【摘要】:中文分词技术的研究是中文信息处理的一项基础性课题,广泛应用于搜索引擎、机器翻译、信息抽取、文本聚类等领域。目前,影响分词质量的主要因素是歧义切分和对未登录词的识别,而人名在未登录词中又是数量最多、识别难度最大的一类,分词系统中往往针对人名有专门的模块进行识别。提高对人名识别的质量,不仅能够提高分词的精度,而且对信息抽取和词法分析有很大帮助。 本文针对现代汉语文本,主要研究人名的自动识别问题。在对大规模姓名样本库和语料库进行统计的基础上,对人名用字和人名边界词进行分析,总结人名用字和人名边界词出现规律,使用基于相对可信度的统计模型和针对系统自身特点设计的一系列规则来进行人名识别。具体地,本文的主要工作有三方面内容:一是对人名识别所使用的资源作分析,对大规模人名库(含480万个人名)和语料库(累计词频30亿)进行统计,总结人名用字特点和规律,对人名的边界信息作了详细分析,根据人名边界词的词性和所表达的意义对其进行了分级,作为人名外部属性帮助人名识别,然后对本文所使用的百科语料库与传统语料库进行了对比,指出其优越性;在本文所使用的统计方法方面,使用基于相对可信度的统计模型对大规模语料库进行了统计,同时对两种特殊形式的人名建立了模型并作出统计,建立了人名各类用字的统计信息表;在规则方法的使用方面,本文设计了一系列的规则用于提取候选姓名和对人名识别结果进行校正。最后本文通过统计获得系统使用的各个阈值和参数,通过实验对在研究过程中使用的方法做了对比,并验证本文所使用的统计模型和规则的有效性。 对1998年1月份《人民日报》语料库进行测试,实验结果表明,本系统获得了较高的准确率和召回率,人名识别获得了良好的效果,提高了整个分词系统的精度。
[Abstract]:The research of Chinese word segmentation is a basic subject of Chinese information processing, which is widely used in search engine, machine translation, information extraction, text clustering and so on. At present, the main factors that affect the quality of word segmentation are ambiguous segmentation and recognition of unrecorded words, but the number of unrecorded words is the largest and the recognition is the most difficult. In the word segmentation system, there is a special module for the recognition of people's names. Improving the quality of human name recognition can not only improve the accuracy of word segmentation, but also help information extraction and lexical analysis. This paper focuses on the automatic recognition of human names in modern Chinese texts. On the basis of the statistics of large scale name sample database and corpus, this paper analyzes the character of human name and the boundary word of person name, and sums up the rule of appearance of the word of name and boundary word of person name. Based on the statistical model of relative credibility and a series of rules designed according to the characteristics of the system, name recognition is carried out. Specifically, the main work of this paper has three aspects: the first is to analyze the resources used in the identification of people's names, and to make statistics on the large-scale names bank (including 4.8 million names) and the corpus (cumulative word frequency 3 billion). This paper summarizes the characteristics and rules of characters used in personal names, analyzes the boundary information of names in detail, classifies them according to their parts of speech and their meanings, and helps them to recognize their names as the external attributes of names. Then, the paper compares the encyclopedia corpus with the traditional corpus, points out its superiority, and uses the statistical model based on the relative credibility to calculate the large-scale corpus in the statistical methods used in this paper. At the same time, the model and statistics of two special forms of names are established, and the statistical information tables of all kinds of characters are established. In this paper, a series of rules are designed to extract candidate names and correct the recognition results. Finally, the threshold and parameters of the system are obtained by statistics, and the methods used in the research are compared through experiments, and the validity of the statistical model and rules used in this paper is verified. The People's Daily corpus in January 1998 was tested. The experimental results show that the system has a high accuracy and recall rate, and the recognition of human names has a good effect and improves the accuracy of the whole word segmentation system.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
[Abstract]:The research of Chinese word segmentation is a basic subject of Chinese information processing, which is widely used in search engine, machine translation, information extraction, text clustering and so on. At present, the main factors that affect the quality of word segmentation are ambiguous segmentation and recognition of unrecorded words, but the number of unrecorded words is the largest and the recognition is the most difficult. In the word segmentation system, there is a special module for the recognition of people's names. Improving the quality of human name recognition can not only improve the accuracy of word segmentation, but also help information extraction and lexical analysis. This paper focuses on the automatic recognition of human names in modern Chinese texts. On the basis of the statistics of large scale name sample database and corpus, this paper analyzes the character of human name and the boundary word of person name, and sums up the rule of appearance of the word of name and boundary word of person name. Based on the statistical model of relative credibility and a series of rules designed according to the characteristics of the system, name recognition is carried out. Specifically, the main work of this paper has three aspects: the first is to analyze the resources used in the identification of people's names, and to make statistics on the large-scale names bank (including 4.8 million names) and the corpus (cumulative word frequency 3 billion). This paper summarizes the characteristics and rules of characters used in personal names, analyzes the boundary information of names in detail, classifies them according to their parts of speech and their meanings, and helps them to recognize their names as the external attributes of names. Then, the paper compares the encyclopedia corpus with the traditional corpus, points out its superiority, and uses the statistical model based on the relative credibility to calculate the large-scale corpus in the statistical methods used in this paper. At the same time, the model and statistics of two special forms of names are established, and the statistical information tables of all kinds of characters are established. In this paper, a series of rules are designed to extract candidate names and correct the recognition results. Finally, the threshold and parameters of the system are obtained by statistics, and the methods used in the research are compared through experiments, and the validity of the statistical model and rules used in this paper is verified. The People's Daily corpus in January 1998 was tested. The experimental results show that the system has a high accuracy and recall rate, and the recognition of human names has a good effect and improves the accuracy of the whole word segmentation system.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 黄德根,马玉霞,杨元生;基于互信息的中文姓名识别方法[J];大连理工大学学报;2004年05期
2 李建华,王晓龙;中文人名自动识别的一种有效方法[J];高技术通讯;2000年02期
3 毋琳;郑逢斌;乔保军;汤赛丽;;HENU汉语分词系统中的中文人名识别算法[J];计算机工程与应用;2006年14期
4 贾品贵;杨一平;卢朋;;基于统计方法的中文姓名识别研究[J];计算机工程与应用;2006年31期
5 曹波;苏一丹;邓琦;;基于最大熵模型的中国人名自动识别[J];计算机工程与应用;2009年04期
6 张腾飞;王晓磊;王保云;;基于场景信息融合的中文姓名识别方法研究[J];计算机工程与应用;2009年34期
7 王源媛;何中市;;基于词性探测的中文姓名识别算法[J];计算机科学;2005年04期
8 高红;黄德根;杨元生;;一种与分词一体化的中文人名识别方法[J];计算机工程;2006年19期
9 李丽双;黄德根;毛婷婷;徐潇潇;;基于支持向量机的中国人名的自动识别[J];计算机工程;2006年19期
10 贾宁;张全;;基于最大熵模型的中文姓名识别[J];计算机工程;2007年09期
相关会议论文 前1条
1 季Y,
本文编号:2187344
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2187344.html