大学英语四级写作自动评分中的隐式篇章关系相关性的评定

发布时间：2018-02-13 19:26

本文关键词： 大学英语四级写作隐式篇章关系相关性潜伏语义分析奇异值降解　出处：《湖北工业大学》2017年硕士论文　论文类型：学位论文

【摘要】：合理的写作自动评分系统应包括语言质量评分和内容质量评分两个方面。区别于作文的语言质量评分,作文内容评分更复杂,需要以语篇为框架分析语块(单词、短语、小句)间的有机联系。大学英语四级写作的评分准则是以内容为主语言为辅的总体评分准则,即作文内容是衡量作文质量的主要标尺。而文本内容就是隐式篇章关系,这是本研究的选题依据之一。自动评分系统的构想如下:计算机计算出待评分作文与已评分作文在隐式篇章关系上的相关性,再参考已评分作文的评分数据,给待评分作文自动评分。判断隐式篇章关系的相关性在整个自动评分系统中处于核心地位,也是本研究的论点。研究隐式篇章关系有两大模型,分别是传统的向量空间模型和潜伏语义分析模型。前者视除停用词外的所有词项为特征向量,并以这些特征向量表征文本。该方法的弊端在于无法解决一词多义及多词同义问题;后者也是从语篇的最小组成成分词汇出发来分析隐式篇章关系,但它辅以语言哲学为视角来探究语言习得乃至知识习得中的相似性及概括性问题,即柏拉图的困惑:人类如何凭借有限的线索信息习得大量知识?本研究的理论依据是后者。潜伏语义分析理论认为,文本中的词汇不是孤立存在的,它们通过某种潜在的语义网络紧密相连。但不是所有的词汇都与该潜在的语义网络直接相关,即我们需要提取与该潜在的语义网络直接相关的特征词汇。特征词项抽取过程分为两步:粗略提取特征词项即文本的预处理,包括完成大小写折叠、去除停用词及词根归一化;调用数学处理软件matlab中的奇异值降解功能函数再次提取特征词项,具体做法分为以下几个步骤:首先构建一个粗提取的特征词项x文本矩阵;然后进行奇异值降解,该函数可将原始矩阵表征为三个小矩阵的乘积;再观察分解后的三个小矩阵的每列的数值,依据具体情况选择前k列数值;调用奇异值降解的反向函数,将三个列数缩减为k的小矩阵相乘重构为一个新矩阵。新矩阵屏蔽了大量噪声信息,保留了原始矩阵中的重要信息,实现了真正意义上地特征抽取。计算机即是以该方法模拟人类识别相似性和实现概括性。这也是本文的理论核心。本文首先以一个经典的精简案例展示了潜伏语义分析理论在评定隐式篇章关系相关性中的重要作用。其次,我们以湖北工业大学非英语专业的本科生四级写作文本作为数据,进行了深入的分析,得出结论:隐式篇章关系的相关系数与人工评分的数据结果的确存在一定的联系。
[Abstract]:A reasonable automatic writing scoring system should include two aspects: language quality score and content quality score. Different from the language quality score of composition, the content score of composition is more complicated, and the text should be used as the frame to analyze the chunks (words, phrases, phrases). The score criterion of CET-4 writing is the general scoring criterion supplemented by content-oriented language, that is, the composition content is the main measure of composition quality, and the text content is the implicit text relation. This is one of the basis of this study. The conception of automatic scoring system is as follows: the computer calculates the correlation between the graded composition and the graded composition in the implicit text relationship, and then refers to the score data of the graded composition. To judge the relevance of implicit text relation is the core of the whole automatic scoring system, which is also the argument of this study. There are two models to study implicit text relationship. They are the traditional vector space model and the latent semantic analysis model. The disadvantage of this method is that it can not solve the problem of polysemy and multi-word synonym, which is also based on the smallest component vocabulary of the text to analyze the implicit text relationship. But from the perspective of linguistic philosophy, it explores the similarity and generality in language acquisition and knowledge acquisition, that is, Plato's puzzlement: how can human beings acquire a large amount of knowledge with limited clue information? The theoretical basis of this study is the latter. The theory of latent semantic analysis holds that the vocabulary in the text does not exist in isolation. They are closely connected through a potential semantic network, but not all words are directly related to that underlying semantic network. In other words, we need to extract the feature words which are directly related to the potential semantic network. The extraction process of feature items is divided into two steps: rough extraction of feature items, namely, preprocessing of text, including completion of case-and-case folding, removal of stop words and root normalization; The singular value degradation function in the mathematical processing software matlab is used to extract the feature terms again. The specific steps are as follows: firstly, a coarse extracted X text matrix of feature terms is constructed; then singular value degradation is carried out. The function can represent the original matrix as the product of three small matrices, observe the values of each column of the three small matrices after decomposition, select the first k column values according to the specific conditions, call the inverse function of singular value degradation, A new matrix is reconstructed by multiplying three small matrices whose number of columns is reduced to k. The new matrix shields a lot of noise information and retains the important information in the original matrix. The computer is used to simulate the similarity and generality of human recognition. This is also the core of this paper. Firstly, this paper shows the latent language with a classic reduced case. The important role of semantic analysis theory in assessing the relevance of implicit text relations. Secondly, Taking the CET-4 writing text of non-English majors in Hubei University of Technology as the data, we make an in-depth analysis and draw a conclusion that the correlation coefficient of implicit text relationship is really related to the data result of artificial score.
【学位授予单位】：湖北工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：H319.3

【参考文献】