基于跨语言迁移学习的泰语依存句法解析方法研究
本文选题:泰语句法解析 + 依存句法解析 ; 参考:《昆明理工大学》2017年硕士论文
【摘要】:随着计算机科学和人工智能的发展,作为人工智能分支的自然语言处理对各国的政治、经济和文化发展起到很大的促进作用,因此自然语言处理研究工作显得尤为重要。句法分析是自然语言处理的重要研究内容,是机器翻译、信息检索和文本分析等领域研究的基础。目前,汉语、英语等语言采用基于传统的依存句法分析的方法研究句法解析相对成熟,但是传统的句法分析方法依赖于大规模标注的语料库和制定复杂的特征模板,人工标注语料库和制定特征模板费时费力,使得传统的依存句法分析方法对缺乏语料资源的语言的研究很难正常开展。基于此,本文提出基于跨语言迁移学习的方法研究缺乏语料资源的依存句法解析,泰语句法研究十分缺乏语料资源,因此针对泰语依存句法分析完成了如下工作:(1)基于汉泰平行句对语料的神经网络双语词分布表示方法。泰语的研究工作相对较少,没有较大规模的语料库,无形中增加了泰语自然语言处理研究的难度。但是汉语和泰语同属汉藏语系,两种语言在句法上有很大的相似性,相对于语料丰富的汉语自然语言处理,泰语可以借鉴汉语的研究。双语词分布表示可以构建两种语言之间的联系,因此本文提出基于汉泰平行句对语料的双语词分布表示模型,实验结果表明词分布表示的准确率达到82.60%。(2)基于迁移学习的泰语依存句法解析方法。汉语在依存句法解析方法上的研究较为成熟,因此,本文在双语词分布表示方法的基础上,运用40000句汉泰平行句对语料,通过从汉语中迁移特征的方法对泰语依存句法分析进行研究。本文所提出的神经网络泰语依存句法解析模型,在依存弧准确率、标识准确率和句子根节点的准确率分别达到79.28%、75.01%和91.25%。(3)泰语依存句法分析系统的可视化。采用Java语言进行开发,输出CoNLL格式的依存语句,同时借助DependencyViewer工具进行界面化显示,可以方便的观察整个句子的水平依存视图和树状视图。通过以上分析,跨语言迁移学习的思想在—定程度上解决了语料稀缺的问题,但是本文同时也考虑了泰语的语言特征进行迁移学习泰语依存句法解析,其中双语词分布表示为迁移学习句法解析做准备,迁移学习句法解析是双语词分布表示的具体应用,并取得较好的效果。
[Abstract]:With the development of computer science and artificial intelligence, natural language processing, as a branch of artificial intelligence, plays an important role in the political, economic and cultural development of various countries. Syntactic analysis is an important part of natural language processing and the basis of machine translation, information retrieval and text analysis. At present, it is relatively mature for Chinese, English and other languages to study syntactic parsing based on traditional dependency parsing. However, traditional syntactic analysis methods rely on large scale annotated corpus and complex feature templates. The manual annotation of corpus and the establishment of feature templates are time-consuming and laborious, which makes it difficult for traditional dependency syntactic analysis methods to study languages that lack of corpus resources. Based on this, this paper proposes a method based on cross-language transfer learning to study the dependency syntactic analysis of the lack of corpus resources, and the lack of corpus resources in Thai syntactic research. Therefore, for the syntactic analysis of Thai dependency, the following work is done: 1) A neural network method for the distribution of bilingual words based on Chinese-Thai parallel sentence pairs is proposed. The research work of Thai is relatively few and there is no large corpus, which makes the study of natural language processing in Thai more difficult. However, both Chinese and Thai belong to Chinese and Tibetan languages, and the two languages are syntactically similar. Bilingual word distribution representation can construct the relationship between two languages, so this paper proposes a bilingual word distribution representation model based on Chinese-Thai parallel sentence pairs. The experimental results show that the accuracy of word distribution representation is 82.60 and that the method of parsing dependent syntax in Thai language is based on transfer learning. The research on the parsing method of dependency syntax in Chinese is more mature. Therefore, based on the distribution representation of bilingual words, this paper applies 40000 Chinese and Thai parallel sentences to the corpus. This paper studies the syntactic analysis of dependency in Thai by means of transferring features from Chinese. The neural network model for parsing Thai dependency syntax is presented in this paper. The accuracy rate of dependency arc, identification accuracy and sentence root node are 79.28%, 75.01% and 91.25%, respectively. With the help of Java language, the dependency statements in CoNLL format are output, and the interfacial display is carried out with the help of DependencyViewer tools, which can conveniently observe the horizontal dependent view and tree view of the whole sentence. Through the above analysis, the idea of cross-language transfer learning solves the problem of the scarcity of language data to a certain extent. However, this paper also takes into account the language characteristics of Thai language to analyze the dependency syntax of Thai language. Bilingual word distribution is the preparation for transfer learning syntactic analysis. Transfer learning syntactic analysis is a concrete application of bilingual word distribution representation and good results have been obtained.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前6条
1 邱立坤;金澎;王厚峰;;基于依存语法构建多视图汉语树库[J];中文信息学报;2015年03期
2 李国臣;党帅兵;王瑞波;李济洪;;基于字的分布表征的汉语基本块识别[J];中文信息学报;2014年06期
3 吴伟成;周俊生;曲维光;;基于统计学习模型的句法分析方法综述[J];中文信息学报;2013年03期
4 车万翔;张梅山;刘挺;;基于主动学习的中文依存句法分析[J];中文信息学报;2012年02期
5 陈利君;;充分发挥昆曼大通道的作用,积极推进与东盟国家的经贸合作[J];中国发展;2011年S1期
6 周强;汉语句法树库标注体系[J];中文信息学报;2004年04期
相关博士学位论文 前2条
1 徐润华;基于词语搭配知识和语法功能匹配的句法分析器[D];南京师范大学;2013年
2 李正华;汉语依存句法分析关键技术研究[D];哈尔滨工业大学;2013年
相关硕士学位论文 前7条
1 赵世瑜;泰语词法分析关键技术研究[D];昆明理工大学;2016年
2 张金鹏;汉泰双语新闻话题发现方法研究[D];昆明理工大学;2016年
3 赵晶;汉语—泰语的跨语言查询翻译和扩展[D];昆明理工大学;2016年
4 李发杰;越南语依存树库构建以及依存关系分析方法研究[D];昆明理工大学;2016年
5 赵治鹏;采用机器学习方法实现泰语分词[D];云南大学;2014年
6 吴伟成;基于SEARN框架的中文句法分析技术研究[D];南京师范大学;2013年
7 陈雨强;图像与文本数据间的异构迁移学习[D];上海交通大学;2012年
,本文编号:1928893
本文链接:https://www.wllwen.com/jingjilunwen/zhengzhijingjixuelunwen/1928893.html