英语文章语法自动检查及纠正的研究与实现
本文选题:语法自动检查及纠正 + 语料库 ; 参考:《北京邮电大学》2016年硕士论文
【摘要】:随着世界一体化进程的逐渐加深,英语作为世界通用语言受到学习者更加广泛的重视。在听、说、读、写这四项英语学习的基本技能中,写作被认为是应用性最强、综合知识面最广、训练难度最大的部分。同时,对于英语为第二语言的学习者来说,由于文化、思维的差异,以及受到母语本身的影响,语法错误是写作中最常见也是最难解决的问题之一。英语文章语法自动检查及纠正主要利用自然语言处理领域的相关技术并结合机器学习方法,让计算机能够自动判断英文句子中是否存在语法错误,并对其进行纠正。本文提出了一种基于语料库的规则自动抽取方法进行规则自动获取,在此基础上提出了基于语料库的有限回退策略的英语文章语法错误检查及纠正算法来对英语文章进行语法自动检查及纠正。首先通过爬虫获取大量的英语文本,并经过文本清洗、断句、词性标注等预处理后建立索引,搭建了一个可供实时查询的语料库,然后结合训练集,通过上述规则自动抽取方法,获得错误的语法规则,基于有限回退策略,对检查出的语法错误候选进行纠正。该方法在2013年CoNLL语法自动检查及纠正评测数据上总体F1为0.3196,超过第一名的0.3120,并且在针对冠词错误的纠正方面F1为0.3345,超过2013年最好成绩0.3340,在针对名词错误的纠正方面F1为0.4531,超过2013年最好成绩0.4435,实验结果表明本文提出的方法对语法错误的检查及纠正有效。本文的主要贡献如下:1.提出了一种利用训练集和语料库自动抽取语法规则的方法,并利用CoNLL2013训练集抽取了 41278条规则。由于人工书写语法规则费时费力,并且可能不完善,同时人工书写的规则对ESL用户的语法错误不具有针对性,而利用自动语法规则抽取方法能有效的解决此问题。2.提出了基于单词和词性混合查询的搜索方式,并搭建了可供实时查询的语料库,包括16618045条来源于纽约时报、批改网学生作文以及CoNLL2013训练集的句子。该语料库可以提供单词、词组、词性以及单词与词性的混合搜索,为本文利用语料库抽取错误语法规则,以及后续的语法自动检查及纠正提供搜索保障。3.提出利用知识库对文本过滤的方法,降低语法错误检查对固定搭配的误判率,并搭建了 一个为语法错误检查纠正提供服务的固定搭配列表。在语法错误自动检查及纠正过程中,极容易忽略符合语言习惯但不一定符合语法的固定搭配,使得系统的准确率降低,因此本文利用固定搭配列表过滤的方式来降低系统的误判率。4.提出了一个基于语料库的有限回退策略的英语文章语法错误检查及纠正算法,来进行语法自动检查及纠正。该算法将回退过程与窗口大小相关联,更加精细的控制整个回退过程,使得整个系统的性能有明显提升。
[Abstract]:With the deepening of the process of world integration, English as a universal language has attracted more and more attention from learners. Among the four basic skills of listening, speaking, reading and writing, writing is considered to be the most applicable, comprehensive and difficult part. At the same time, for EFL learners, grammatical errors are one of the most common and difficult problems in writing because of the differences in culture, thinking and the influence of their mother tongue. The automatic checking and correcting of English grammar mainly use the related techniques in the field of natural language processing and the method of machine learning, so that the computer can automatically judge whether there are grammatical errors in English sentences and correct them. In this paper, a method of automatic rule extraction based on corpus is proposed. On this basis, a corpus-based algorithm for checking and correcting grammatical errors of English articles is proposed to automatically check and correct the grammar of English articles. Firstly, a large amount of English text is obtained by crawler, and then the index is built after pretreatment such as text cleaning, breakage and part of speech tagging, and a corpus is built for real-time query, and then the training set is combined. Through the automatic extraction of the above rules, the error syntax rules are obtained, and the checked syntax error candidates are corrected based on the finite fallback strategy. In 2013, the total F1 is 0.3196, which is more than 0.3120 in the first place, and the F1 is 0.3345 in correcting the error of article, which is higher than the best score of 0.3340 in 2013. Face F1 is 0.4531, which exceeds the best score in 2013 by 0.4435. The experimental results show that the method proposed in this paper is effective in checking and correcting grammatical errors. The main contributions of this paper are as follows: 1. A method of automatically extracting grammar rules from training set and corpus is proposed, and 41278 rules are extracted by using the training set of CoNLL2013. Because manual writing grammar rules are time-consuming and laborious, and may not be perfect, the manual writing rules have no pertinence to the syntax errors of ESL users. However, the automatic grammar rule extraction method can effectively solve this problem. This paper proposes a search method based on word and part of speech query, and builds a corpus for real-time query, including 16618045 sentences from the New York Times, correction of students' compositions and training set of CoNLL2013. This corpus can provide the search for words, phrases, parts of speech and the mixture of words and parts of speech, which provides the search guarantee for extracting the wrong grammar rules by using the corpus, and for the subsequent automatic checking and correcting of the grammar. A method of text filtering based on knowledge base is proposed to reduce the error rate of grammatical error checking for fixed collocations, and a list of fixed collocations to provide services for grammatical error checking and correction is set up. In the process of automatic checking and correcting grammatical errors, it is easy to ignore the fixed collocation that conforms to the language habit but not necessarily the grammar, so that the accuracy of the system is reduced. Therefore, this paper uses fixed collocation list filtering to reduce the error rate of the system. 4. This paper presents an algorithm for checking and correcting grammatical errors in English articles based on a corpus-based finite fallback strategy to carry out automatic grammar checking and correction. The algorithm correlates the fallback process with the window size, and controls the whole fallback process more finely, so that the performance of the whole system can be improved obviously.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 袁昌万;金双军;;基于语料库大数据的英语写作实证研究[J];重庆交通大学学报(社会科学版);2015年04期
2 杨莉;;基于语料库的大学英语写作教学研究——以句酷批改网为例[J];时代文学(下半月);2015年03期
3 吴伟成;周俊生;曲维光;;基于统计学习模型的句法分析方法综述[J];中文信息学报;2013年03期
4 董喜双;关毅;;基于有监督学习的依存句法分析模型综述[J];智能计算机与应用;2013年02期
5 马立东;;编辑距离算法及其在英语易混词自动抽取中的应用[J];智能计算机与应用;2013年01期
6 张杨;;如何提高学生英语写作水平[J];黑龙江教育学院学报;2012年05期
7 方宗祥;;英语名词在中国语境下的本土化现象——“propaganda”个案研究[J];外语学刊;2012年02期
8 孙立伟;何国辉;吴礼发;;网络爬虫技术的研究[J];电脑知识与技术;2010年15期
9 叶舟;王东;;基于规则引擎的数据清洗[J];计算机工程;2006年23期
10 张仰森;曹元大;俞士汶;;基于规则与统计相结合的中文文本自动查错模型与算法[J];中文信息学报;2006年04期
相关博士学位论文 前1条
1 刘磊;面向自动语法检查的依存规则研究[D];北京外国语大学;2014年
相关硕士学位论文 前1条
1 张璇;新闻报道中中国英语句法结构特征的量化研究[D];广西师范大学;2004年
,本文编号:2015206
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2015206.html