当前位置:主页 > 文艺论文 > 汉语言论文 >

儿童语言习得的计算模型研究

发布时间:2018-12-13 03:09
【摘要】:语言习得的计算模型研究基于计算技术的语言知识获取,是高质量自然语言处理应用中不可或缺的部分。儿童时期是语言知识习得的关键期,人类在儿童时期就习得了基本的语言知识,因此发展儿童语言习得的计算模型,对于语言知识习得计算模型的研究具有重要的价值。同时,发展儿童语言习得计算模型,尤其是发展能有效引入各种认知过程的计算模型是研究和评估儿童语言习得过程中各种认知假设的一个非常有效的途径,对于揭示儿童语言发展的机理具有重要价值。为此,人们从计算语言学、认知心理学、发展语言学等不同领域出发开展了丰富的儿童语言习得计算模型的研究。 然而,现有的儿童语言习得计算模型还存在一些缺陷。例如:词汇范畴习得模型没有统一的评测方法,并需要预设范畴数目;句法习得模型对长距相依现象的描述能力弱;在模型中引入儿童语言习得的认知研究成果还不够。 本文针对语言习得计算模型存在的上述不足之处,在儿童语料库建设、儿童词汇范畴习得和句法习得的计算模型等几个方面开展了多项研究工作,论文的主要工作和研究成果有: (1)建立了一个儿童以及儿向口语汉字语料库,并在字、词以及句子三个层面上,对儿童语言、儿向语言(成人向儿童说的语言CDS:Children Directed Language)以及成人语言进行了统计、对比和分析。儿童的语言能力体现在儿童产生的儿童语言以及对儿向语言的理解能力之上,儿童语言和儿向语言与成人语言具有较大的差异,因此,构建儿童和儿向语料库是研究儿童语言习得的基础。儿童语言习得的计算模型需要基于儿童和儿向语料库而建立,在训练或者评测时,也应基于儿童和儿向语料库。为此,作为开展儿童语言习得计算模型研究的第一步,本文首先基于目前世界上最大的儿童口语语料库CHILDES中的中文语料,通过转写、标注和校正,建立了一个儿童及儿向口语语料库。 (2)对于儿童词汇范畴习得计算模型,本文从评测方法和计算模型两个方面开展了研究。 提出了一种称为一致度(Cohesivity)的新度量来评测词汇范畴习得的性能,该度量能综合考虑信息性、多样性和精确性三个评测准则,实验表明了其可行性和有效性。 提出了采用狄利克雷过程混合模型(Dirichlet Process Mixture Models, DPMMs)和近邻传播算法(Affinity Propagation, AP)进行词汇范畴习得,避免了以往研究中需要预定义范畴数量的问题。进而,基于其它认知通道可以为语言习得提供先验信息这一认知过程,采用人工标注的种子词模拟来自其他通道的先验信息构建了一种半监督AP算法,实验结果表明了这种先验信息的有效性。 (3)本文提出了一种阶段式的句法习得模型,建模儿童句法习得从简单到复杂、从具体到抽象的认知过程,实验结果表明了模型的有效性。 该模型的句法习得分为三个阶段,第一阶段,习得连续的具体结构。在这一阶段,只考虑连续的终结符组成的句法结构;第二阶段,习得长距离依存结构。在这一阶段,仍然只考虑终结符,但是可以习得非连续的结构;第三阶段,习得层次结构。这一阶段,习得终结符和非终结符混合的层次句法结构,最终完成句法结构的习得。 (4)本文建模了儿童语言中词汇范畴和句法结构分阶段增量式增长这一认知过程,将所提出的词汇范畴习得模型分阶段训练并结合到上述阶段式句法习得模型中,提出了一个基于词汇范畴的句法习得模型框架。并将模型应用于语言生成中,将生成的语言与儿童语言、儿向语言进行了对比,人工评测了模型所生成的语言。实验表明结合词汇范畴信息能有效提高句法习得的性能,生成的语言具有较好的流畅性和可理解性。
[Abstract]:The calculation model of the language learning is based on the language knowledge acquisition of the computing technology and is an integral part of the high-quality natural language processing application. The period of the child is the key period of the study of the language knowledge, and the basic language knowledge is learned in the childhood, so the calculation model of the children's language study is developed, and the study of the language knowledge learning model is of great value. At the same time, developing the model of the child's language study, especially the calculation model that can effectively introduce various cognitive processes, is a very effective way to study and evaluate the various cognitive hypotheses in the course of children's language study, which is of great value to reveal the mechanism of the development of children's language. To this end, a study of the computational models of children's language learning has been carried out in various fields, such as computational linguistics, cognitive psychology, and development linguistics. However, there are still some shortcomings in the existing computing model of children's language learning For example, there is no uniform evaluation method of the model of the vocabulary category, and the number of the pre-set categories is required; the model of the syntax study is weak in the description ability of the long-distance dependent phenomenon; the cognitive research results that are introduced into the children's language study in the model do not This paper has carried out a number of research work in the aspects of children's corpus construction, children's vocabulary category and the calculation model of the syntax, and the main work and research of the paper. The results are as follows: (1) A child and a child-to-oral Chinese word corpus are established, and in the three dimensions of words, words and sentences, the language of the child, the language of the child to the child (the language CDS: Children's Directed Language) and the adult language are counted. By contrast and analysis, the children's language ability is reflected in the children's language and the understanding ability of the children to the language, and the language and the language of the children have a great difference in the language and the adult language. Therefore, the construction of the child and the child-oriented corpus is the study of the children's language The learning model for children's language learning needs to be established based on a corpus of children and children, and should also be based on children and children in training or evaluation To this end, as a first step in the study of the study of children's language learning, this paper first set up a child and a child based on the Chinese corpus in the largest child's spoken corpus CHILDES in the world. (2) For children's vocabulary, this paper is based on the evaluation method and the calculation model. In this paper, a new measure, called Cohesion, is presented to evaluate the performance of the vocabulary, which can comprehensively consider the three evaluation criteria of information, diversity and accuracy. The feasibility and effectiveness of the method are presented. The Dirichlet Process Mixed Models (DPMMs) and the nearest neighbor propagation (AP) are used to study the vocabulary, so as to avoid the need of the previous research. In order to solve the problem of the number of predefined categories, based on other cognitive channels, we can provide a priori information for language learning, and a semi-supervised AP algorithm is constructed by using the seed word of the artificial dimension to simulate the prior information from other channels. The experimental results show that The validity of this prior information is presented in this paper. (3) In this paper, a kind of stage-based syntax learning model is put forward, and the modeling of children's syntax is from simple to complex and from concrete to abstract cognitive process. The experimental results show the validity of the model. The syntax of the model is divided into three stages. In the first stage, a continuous concrete structure is learned. In this stage, only the syntactic structure of the continuous final form is considered. in that second stage, the long-distance dependency structure is learn. in this stage, only the final character is considered, but it can be learned that it is not continuous a structure; a third stage, a hierarchical structure, which is a hierarchical sentence mixed with a final and a non-final character. (4) The cognitive process of the lexical category and the sentence structure in children's language is modeled in this paper, and the model of the proposed vocabulary is selected in stages. in that model of the above-mentioned stage-type syntax, it is put forward A syntax-based model framework based on the vocabulary category. The model is applied to the language generation, and the generated language is compared with the children's language. By contrast, the language generated by the model is evaluated manually. The experiment shows that the combination of the lexical category information can effectively improve the performance of the syntactic study,
【学位授予单位】:北京邮电大学
【学位级别】:博士
【学位授予年份】:2012
【分类号】:H193.1

【参考文献】

相关期刊论文 前2条

1 周晓红;;第一语言习得研究概况[J];理论界;2008年05期

2 周强;汉语句法树库标注体系[J];中文信息学报;2004年04期



本文编号:2375765

资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/hanyulw/2375765.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户d93e6***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com