当前位置:主页 > 文艺论文 > 汉语言论文 >

双字低频未登录词识别研究

发布时间:2018-02-11 13:03

  本文关键词: 低频 双字 未登录词 素性 网络检索 出处:《南京师范大学》2012年硕士论文 论文类型:学位论文


【摘要】:未登录词是影响中文自动分词精度的最主要原因,低频词是未登录词识别的难点,而双字低频未登录词又是低频未登录词的重要组成。所以,文章着重研究如何高效识别双字低频未登录词,选用多种统计和规则相结合的方法,取得了一定的效果。 在识别低频双字未登录词的过程中,为了提高识别效率并对实验结果进行有效统计研究,我们进行前期处理,主要分为三步:一、分词并提取分词碎片。二、识别未登录词中的重要组成——命名实体。三、识别部分多字未登录词。之后我们在所剩碎片中判别低频双字未登录词,采用了多种统计与规则相结合的办法,有互信息、成词非词概率、邻字熵、素性组合。虽然实验结果一般,但在辅助识别、提取新词上依然具有实用的价值,可以为人工识别减轻大量负担。我们在识别过程中发现,词定义的模糊性、语料中分词不一致是双字未登录词难以正确识别的重要原因,因此,我们对此进行了深入的研究,提出了对双字词的新的合理定义。之后,我们自己标注了小型的测试语料,在同样的识别方法下,正确率和召回率都有较大提高。最后我们还提出并实现了一种基于网络的判别方法,对“结合紧密、使用稳定”这一属性进行了量化,该方法在判定双字低频未登录词的实验中表现出色,F值最高达到了86%。可见,使用网络资源可能是提高自动分词、特别是未登录词自动识别效果的突破口。
[Abstract]:The unrecorded word is the main reason that affects the precision of Chinese automatic word segmentation, the low frequency word is the difficulty of identifying the unrecorded word, and the double word low frequency unrecorded word is the important component of the low frequency unrecorded word. This paper focuses on how to efficiently identify low frequency unrecorded words with double characters and select a variety of methods combining statistics and rules to achieve certain results. In the process of identifying low-frequency double-word unrecorded words, in order to improve the efficiency of recognition and carry on the effective statistical research on the experimental results, we carry out preliminary processing, mainly divided into three steps: first, participle and extract the fragment of participle. Identify the important component of the unrecorded word named entity. Third, identify part of the multi-word unentered word. Then we distinguish the low-frequency double-word word from the remaining fragments. We adopt a variety of methods combining statistics and rules, and have mutual information. Although the experimental results are general, they still have practical value in auxiliary recognition and extraction of new words, which can lighten a large amount of burden for manual recognition. The ambiguity of the definition of words and the inconsistent segmentation in the corpus are the important reasons why it is difficult to recognize the double-character unrecorded words correctly. Therefore, we have made a deep research on this and put forward a new and reasonable definition of double-character words. We annotate the small test corpus, and under the same recognition method, the correct rate and recall rate are improved greatly. Finally, we propose and implement a network-based discriminant method. This method has been quantized by using the attribute of "stable". This method has performed well in the experiment of judging double-character low-frequency unrecorded words, and the highest F value has reached 860.It can be seen that the use of network resources may be to improve the automatic word segmentation. Especially the breakthrough of automatic recognition effect of unrecorded words.
【学位授予单位】:南京师范大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:H08

【参考文献】

相关期刊论文 前10条

1 宋作艳;;字族化与汉语未登录词的自动提取[J];北京大学学报(哲学社会科学版);2007年02期

2 胡俊峰,俞士汶;唐宋诗之计算机辅助深层研究[J];北京大学学报(自然科学版);2001年05期

3 罗智勇;宋柔;;基于多特征的自适应新词识别[J];北京工业大学学报;2007年07期

4 朱靖波,张玫杰,姚天顺;一种基于NA假设的训练数据自动构造方法[J];东北大学学报;1999年04期

5 孙茂松,邹嘉彦;汉语自动分词研究评述[J];当代语言学;2001年01期

6 侯汉清,薛鹏军;基于知识库的网页自动标引和自动分类系统的设计[J];大学图书馆学报;2004年01期

7 马颖华,王永成,苏贵洋;一种在汉语文本中抽取重复字串的快速算法[J];电子学报;2002年S1期

8 吕学强,张乐,黄志丹,胡俊峰;基于散列技术的快速子串归并算法[J];复旦学报(自然科学版);2004年05期

9 胡婕;李跃新;;数据库受限汉语自然语言查询的分词研究与实现[J];湖北大学学报(自然科学版);2005年04期

10 马光志,李专;基于特征词的自动分词研究[J];华中科技大学学报(自然科学版);2003年03期



本文编号:1503126

资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/hanyulw/1503126.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户60442***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com