中文自动分词技术的改进与优化研究
发布时间:2018-11-10 09:57
【摘要】:中文自动分词技术是中文信息处理领域中一项重要的基础性课题,它对相关领域(如信息抽取、全文检索、数据挖掘、机器翻译、问答系统等领域)的研究有着巨大的推动作用。本文对中文自动分词领域涉及的主要技术进行了比较全面和仔细的研究,包括中文自动分词的词典结构、中文自动分词的分词算法;对中文分词中的难点问题进行了相对深入的研究;最后结合当前热门的搜索引擎技术,讲述了中文自动分词技术在这个领域的应用。 本文的主要贡献如下: 首先,本文对中文自动分词技术中的词典结构进行了广泛和深入的研究,在综合逐字二分、逐词二分和Trie索引树三种经典词典结构的基础上,又借鉴和学习了众多改进的词典机制,,最后提出了一种基于多哈希平衡二叉查找树的分词词典机制。 其次,本文在命名实体识别方面进行了重点突破。在中文人名识别上,结合和借鉴现有的研究结果,设计了一种新的分阶段的中文人名识别方法,并给出了具体的实现过程。在中文机构名识别方面,本文在CRF统计模型的基础之上,融入语言学领域的规则和知识,设计和实现了基于CRF和规则的中文医疗机构名识别系统。实验结果显示,封闭测试的准确率和召回率分别达到了91.68%和95.21%,给领域机构名的识别提供了一种切实可行的新思路。 最后,结合当今社会对海量信息检索的迫切需求,对中文自动分词技术在搜索引擎领域的应用做了比较详细的介绍,一方面推广了中文自动分词技术,另一方面也为搜索引擎未来的优化和发展做了一个很好的指向。
[Abstract]:Chinese automatic word segmentation technology is an important basic topic in the field of Chinese information processing. It provides information extraction, full-text retrieval, data mining, machine translation to related fields, such as information extraction, full-text retrieval, data mining, and machine translation. Question and answer system and other fields) has a great role in promoting the research. In this paper, the main technologies involved in the field of Chinese automatic word segmentation are studied comprehensively and carefully, including the dictionary structure of Chinese automatic word segmentation, the word segmentation algorithm of Chinese automatic word segmentation; The difficult problems in Chinese word segmentation are studied deeply. Finally, the application of Chinese automatic word segmentation technology in this field is described in combination with the popular search engine technology. The main contributions of this paper are as follows: firstly, the dictionary structure of Chinese automatic word segmentation is studied extensively and deeply, which is based on three classical dictionaries: word by word dichotomy, word by word dichotomy and Trie index tree. Finally, a word segmentation dictionary mechanism based on multi-hash balanced binary search tree is proposed. Secondly, this paper has carried on the key breakthrough in the naming entity recognition aspect. In the aspect of Chinese personal name recognition, a new method of Chinese personal name recognition is designed based on the existing research results, and the realization process is given. In the aspect of Chinese institution name recognition, this paper designs and implements a Chinese medical institution name recognition system based on CRF and rules, which is based on the CRF statistical model, and integrates the rules and knowledge in the field of linguistics. The experimental results show that the accuracy and recall rate of closed test are 91.68% and 95.2121% respectively. Finally, according to the urgent need of mass information retrieval in today's society, the application of Chinese automatic word segmentation technology in search engine is introduced in detail. On the one hand, the Chinese automatic word segmentation technology is popularized. On the other hand, it also makes a good point for the future optimization and development of search engine.
【学位授予单位】:江苏科技大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
本文编号:2322110
[Abstract]:Chinese automatic word segmentation technology is an important basic topic in the field of Chinese information processing. It provides information extraction, full-text retrieval, data mining, machine translation to related fields, such as information extraction, full-text retrieval, data mining, and machine translation. Question and answer system and other fields) has a great role in promoting the research. In this paper, the main technologies involved in the field of Chinese automatic word segmentation are studied comprehensively and carefully, including the dictionary structure of Chinese automatic word segmentation, the word segmentation algorithm of Chinese automatic word segmentation; The difficult problems in Chinese word segmentation are studied deeply. Finally, the application of Chinese automatic word segmentation technology in this field is described in combination with the popular search engine technology. The main contributions of this paper are as follows: firstly, the dictionary structure of Chinese automatic word segmentation is studied extensively and deeply, which is based on three classical dictionaries: word by word dichotomy, word by word dichotomy and Trie index tree. Finally, a word segmentation dictionary mechanism based on multi-hash balanced binary search tree is proposed. Secondly, this paper has carried on the key breakthrough in the naming entity recognition aspect. In the aspect of Chinese personal name recognition, a new method of Chinese personal name recognition is designed based on the existing research results, and the realization process is given. In the aspect of Chinese institution name recognition, this paper designs and implements a Chinese medical institution name recognition system based on CRF and rules, which is based on the CRF statistical model, and integrates the rules and knowledge in the field of linguistics. The experimental results show that the accuracy and recall rate of closed test are 91.68% and 95.2121% respectively. Finally, according to the urgent need of mass information retrieval in today's society, the application of Chinese automatic word segmentation technology in search engine is introduced in detail. On the one hand, the Chinese automatic word segmentation technology is popularized. On the other hand, it also makes a good point for the future optimization and development of search engine.
【学位授予单位】:江苏科技大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 孙茂松,邹嘉彦;汉语自动分词研究评述[J];当代语言学;2001年01期
2 林亚平,刘云中,周顺先,陈治平,蔡立军;基于最大熵的隐马尔可夫模型文本信息抽取[J];电子学报;2005年02期
3 周俊生;戴新宇;尹存燕;陈家骏;;基于层叠条件随机场模型的中文机构名自动识别[J];电子学报;2006年05期
4 马哲,姚敏;一种改进的基于PATRICIA树的汉语自动分词词典机制[J];华南理工大学学报(自然科学版);2004年S1期
5 骆卫华,罗振声,宫小瑾;中文文本自动校对技术的研究[J];计算机研究与发展;2004年01期
6 刘群,张华平,俞鸿魁,程学旗;基于层叠隐马模型的汉语词法分析[J];计算机研究与发展;2004年08期
7 罗智勇;宋柔;;现代汉语通用分词系统中歧义切分的实用技术[J];计算机研究与发展;2006年06期
8 李振星,徐泽平,唐卫清,唐荣锡;全二分最大匹配快速分词算法[J];计算机工程与应用;2002年11期
9 张华平,刘群;基于角色标注的中国人名自动识别研究[J];计算机学报;2004年01期
10 王瑞雷;栾静;潘晓花;卢修配;;一种改进的中文分词正向最大匹配算法[J];计算机应用与软件;2011年03期
本文编号:2322110
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2322110.html