英语文章语法自动检查及纠正的研究与实现

发布时间：2018-06-13 19:48

本文选题：语法自动检查及纠正 + 语料库　；参考：《北京邮电大学》2016年硕士论文

【摘要】：随着世界一体化进程的逐渐加深,英语作为世界通用语言受到学习者更加广泛的重视。在听、说、读、写这四项英语学习的基本技能中,写作被认为是应用性最强、综合知识面最广、训练难度最大的部分。同时,对于英语为第二语言的学习者来说,由于文化、思维的差异,以及受到母语本身的影响,语法错误是写作中最常见也是最难解决的问题之一。英语文章语法自动检查及纠正主要利用自然语言处理领域的相关技术并结合机器学习方法,让计算机能够自动判断英文句子中是否存在语法错误,并对其进行纠正。本文提出了一种基于语料库的规则自动抽取方法进行规则自动获取,在此基础上提出了基于语料库的有限回退策略的英语文章语法错误检查及纠正算法来对英语文章进行语法自动检查及纠正。首先通过爬虫获取大量的英语文本,并经过文本清洗、断句、词性标注等预处理后建立索引,搭建了一个可供实时查询的语料库,然后结合训练集,通过上述规则自动抽取方法,获得错误的语法规则,基于有限回退策略,对检查出的语法错误候选进行纠正。该方法在2013年CoNLL语法自动检查及纠正评测数据上总体F1为0.3196,超过第一名的0.3120,并且在针对冠词错误的纠正方面F1为0.3345,超过2013年最好成绩0.3340,在针对名词错误的纠正方面F1为0.4531,超过2013年最好成绩0.4435,实验结果表明本文提出的方法对语法错误的检查及纠正有效。本文的主要贡献如下:1.提出了一种利用训练集和语料库自动抽取语法规则的方法,并利用CoNLL2013训练集抽取了 41278条规则。由于人工书写语法规则费时费力,并且可能不完善,同时人工书写的规则对ESL用户的语法错误不具有针对性,而利用自动语法规则抽取方法能有效的解决此问题。2.提出了基于单词和词性混合查询的搜索方式,并搭建了可供实时查询的语料库,包括16618045条来源于纽约时报、批改网学生作文以及CoNLL2013训练集的句子。该语料库可以提供单词、词组、词性以及单词与词性的混合搜索,为本文利用语料库抽取错误语法规则,以及后续的语法自动检查及纠正提供搜索保障。3.提出利用知识库对文本过滤的方法,降低语法错误检查对固定搭配的误判率,并搭建了一个为语法错误检查纠正提供服务的固定搭配列表。在语法错误自动检查及纠正过程中,极容易忽略符合语言习惯但不一定符合语法的固定搭配,使得系统的准确率降低,因此本文利用固定搭配列表过滤的方式来降低系统的误判率。4.提出了一个基于语料库的有限回退策略的英语文章语法错误检查及纠正算法,来进行语法自动检查及纠正。该算法将回退过程与窗口大小相关联,更加精细的控制整个回退过程,使得整个系统的性能有明显提升。
[Abstract]:With the deepening of the process of world integration, English as a universal language has attracted more and more attention from learners. Among the four basic skills of listening, speaking, reading and writing, writing is considered to be the most applicable, comprehensive and difficult part. At the same time, for EFL learners, grammatical errors are one of the most common and difficult problems in writing because of the differences in culture, thinking and the influence of their mother tongue. The automatic checking and correcting of English grammar mainly use the related techniques in the field of natural language processing and the method of machine learning, so that the computer can automatically judge whether there are grammatical errors in English sentences and correct them. In this paper, a method of automatic rule extraction based on corpus is proposed. On this basis, a corpus-based algorithm for checking and correcting grammatical errors of English articles is proposed to automatically check and correct the grammar of English articles. Firstly, a large amount of English text is obtained by crawler, and then the index is built after pretreatment such as text cleaning, breakage and part of speech tagging, and a corpus is built for real-time query, and then the training set is combined. Through the automatic extraction of the above rules, the error syntax rules are obtained, and the checked syntax error candidates are corrected based on the finite fallback strategy. In 2013, the total F1 is 0.3196, which is more than 0.3120 in the first place, and the F1 is 0.3345 in correcting the error of article, which is higher than the best score of 0.3340 in 2013. Face F1 is 0.4531, which exceeds the best score in 2013 by 0.4435. The experimental results show that the method proposed in this paper is effective in checking and correcting grammatical errors. The main contributions of this paper are as follows: 1. A method of automatically extracting grammar rules from training set and corpus is proposed, and 41278 rules are extracted by using the training set of CoNLL2013. Because manual writing grammar rules are time-consuming and laborious, and may not be perfect, the manual writing rules have no pertinence to the syntax errors of ESL users. However, the automatic grammar rule extraction method can effectively solve this problem. This paper proposes a search method based on word and part of speech query, and builds a corpus for real-time query, including 16618045 sentences from the New York Times, correction of students' compositions and training set of CoNLL2013. This corpus can provide the search for words, phrases, parts of speech and the mixture of words and parts of speech, which provides the search guarantee for extracting the wrong grammar rules by using the corpus, and for the subsequent automatic checking and correcting of the grammar. A method of text filtering based on knowledge base is proposed to reduce the error rate of grammatical error checking for fixed collocations, and a list of fixed collocations to provide services for grammatical error checking and correction is set up. In the process of automatic checking and correcting grammatical errors, it is easy to ignore the fixed collocation that conforms to the language habit but not necessarily the grammar, so that the accuracy of the system is reduced. Therefore, this paper uses fixed collocation list filtering to reduce the error rate of the system. 4. This paper presents an algorithm for checking and correcting grammatical errors in English articles based on a corpus-based finite fallback strategy to carry out automatic grammar checking and correction. The algorithm correlates the fallback process with the window size, and controls the whole fallback process more finely, so that the performance of the whole system can be improved obviously.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【参考文献】