繁体中文拼写检错研究

发布时间：2018-03-27 15:57

本文选题：中文语言处理　切入点：拼写检错　出处：《南京邮电大学》2016年硕士论文

【摘要】：繁体中文拼写检错指使用计算机自动检测繁体中文文本中是否存在汉字误用的技术,是中文信息处理领域的一个重要研究课题,是许多自然语言处理系统的重要部分,如搜索引擎、文字处理软件等系统。与西方常用的语言如英语相比,中文语言有更加复杂的语言特性:词与词之间没有明显的分隔符、词语搭配复杂多样、语法搭配复杂多样,所以繁体中文拼写检错的研究比英文更加困难。简体中文拼写检错的研究早于繁体中文拼写检错的研究,所形成的主要方法包括基于规则、基于统计、以及基于特征与学习的方法,然而这些方法基于简体语料库,并且无法适用于多种拼写错误的检测,因此它们仅能作为参考方法。近年来,随着繁体中文拼写检错评测的开展,繁体中文拼写检错的研究已经渐渐成为中文信息处理领域研究的热点。本文以检测繁体文本中存在的拼写错误为研究目标,提出三种有效的检错方法:(1)首先本文提出一种基于字串切分统计词典的检错方法,利用语料库中字串出现的频率信息作为检错依据,根据字串及其频率信息来建立统计词典,并设计了基于统计规则评判的检错算法。(2)其次本文提出一种基于图模型与词性bi-gram模型的繁体中文拼写检错方法,以中文分词为基础,将分词结果和可疑词替换结果以图模型来表示,并辅以词性bi-gram模型来确定最终错误字。(3)最后本文针对常用助词“的、地、得”的错误,提出一种基于上下文词性统计模型的方法,该方法利用训练语料库建立上下文词性统计模型,并依据模型来判断助词使用是否正确。本文以繁体中文拼写评测数据集为实验数据集,对提出的三种检错方法都进行了实验验证,并与现有的检错方法进行对比,实验结果说明本文的研究方法可以取得了较好的效果,进一步地推动了繁体中文拼写检错技术的发展。
[Abstract]:Traditional Chinese spelling and error checking refers to the use of computer to automatically detect the misuse of Chinese characters in traditional Chinese texts, which is an important research topic in the field of Chinese information processing and an important part of many natural language processing systems. Such as search engine, word processing software and so on. Compared with common western languages such as English, Chinese language has more complicated language characteristics: there is no obvious separator between words and words, word collocation is complex and diverse, grammatical collocation is complex and diverse, Therefore, the study of traditional Chinese spelling correction is more difficult than that of English. The simplified Chinese spelling check is earlier than the traditional Chinese spelling check, and the main methods are based on rules and statistics. And the methods based on features and learning, however, these methods are based on simplified corpus and can not be used for the detection of many spelling errors, so they can only be used as reference methods. The research of traditional Chinese spelling correction has gradually become a hot topic in the field of Chinese information processing. This paper aims to detect spelling errors in traditional Chinese texts. First of all, this paper presents an error detection method based on the statistical dictionary of string segmentation, which uses the frequency information of the string in the corpus as the basis of error detection, and establishes a statistical dictionary based on the string and its frequency information. Secondly, this paper proposes a traditional Chinese spelling error detection method based on graph model and part of speech bi-gram model, which is based on Chinese word segmentation. The participle result and suspect word replacement result are represented by graph model, and the final error word is determined by the part of speech bi-gram model. Finally, this paper aims at the common auxiliary word ", ground, get" error. A method based on the statistical model of contextual part-of-speech is proposed, which uses the training corpus to establish the statistical model of contextual part-of-speech. According to the model to judge whether the use of auxiliary words is correct or not. This paper takes the traditional Chinese spelling evaluation data set as the experimental data set, carries on the experimental verification to the proposed three kinds of error detection methods, and carries on the comparison with the existing error detection method. The experimental results show that the research method in this paper can achieve good results and further promote the development of traditional Chinese spelling error detection technology.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】