基于跨语言迁移学习的泰语依存句法解析方法研究

发布时间：2018-05-24 11:31

本文选题：泰语句法解析 + 依存句法解析　；参考：《昆明理工大学》2017年硕士论文

【摘要】：随着计算机科学和人工智能的发展,作为人工智能分支的自然语言处理对各国的政治、经济和文化发展起到很大的促进作用,因此自然语言处理研究工作显得尤为重要。句法分析是自然语言处理的重要研究内容,是机器翻译、信息检索和文本分析等领域研究的基础。目前,汉语、英语等语言采用基于传统的依存句法分析的方法研究句法解析相对成熟,但是传统的句法分析方法依赖于大规模标注的语料库和制定复杂的特征模板,人工标注语料库和制定特征模板费时费力,使得传统的依存句法分析方法对缺乏语料资源的语言的研究很难正常开展。基于此,本文提出基于跨语言迁移学习的方法研究缺乏语料资源的依存句法解析,泰语句法研究十分缺乏语料资源,因此针对泰语依存句法分析完成了如下工作:(1)基于汉泰平行句对语料的神经网络双语词分布表示方法。泰语的研究工作相对较少,没有较大规模的语料库,无形中增加了泰语自然语言处理研究的难度。但是汉语和泰语同属汉藏语系,两种语言在句法上有很大的相似性,相对于语料丰富的汉语自然语言处理,泰语可以借鉴汉语的研究。双语词分布表示可以构建两种语言之间的联系,因此本文提出基于汉泰平行句对语料的双语词分布表示模型,实验结果表明词分布表示的准确率达到82.60%。(2)基于迁移学习的泰语依存句法解析方法。汉语在依存句法解析方法上的研究较为成熟,因此,本文在双语词分布表示方法的基础上,运用40000句汉泰平行句对语料,通过从汉语中迁移特征的方法对泰语依存句法分析进行研究。本文所提出的神经网络泰语依存句法解析模型,在依存弧准确率、标识准确率和句子根节点的准确率分别达到79.28%、75.01%和91.25%。(3)泰语依存句法分析系统的可视化。采用Java语言进行开发,输出CoNLL格式的依存语句,同时借助DependencyViewer工具进行界面化显示,可以方便的观察整个句子的水平依存视图和树状视图。通过以上分析,跨语言迁移学习的思想在—定程度上解决了语料稀缺的问题,但是本文同时也考虑了泰语的语言特征进行迁移学习泰语依存句法解析,其中双语词分布表示为迁移学习句法解析做准备,迁移学习句法解析是双语词分布表示的具体应用,并取得较好的效果。
[Abstract]:With the development of computer science and artificial intelligence, natural language processing, as a branch of artificial intelligence, plays an important role in the political, economic and cultural development of various countries. Syntactic analysis is an important part of natural language processing and the basis of machine translation, information retrieval and text analysis. At present, it is relatively mature for Chinese, English and other languages to study syntactic parsing based on traditional dependency parsing. However, traditional syntactic analysis methods rely on large scale annotated corpus and complex feature templates. The manual annotation of corpus and the establishment of feature templates are time-consuming and laborious, which makes it difficult for traditional dependency syntactic analysis methods to study languages that lack of corpus resources. Based on this, this paper proposes a method based on cross-language transfer learning to study the dependency syntactic analysis of the lack of corpus resources, and the lack of corpus resources in Thai syntactic research. Therefore, for the syntactic analysis of Thai dependency, the following work is done: 1) A neural network method for the distribution of bilingual words based on Chinese-Thai parallel sentence pairs is proposed. The research work of Thai is relatively few and there is no large corpus, which makes the study of natural language processing in Thai more difficult. However, both Chinese and Thai belong to Chinese and Tibetan languages, and the two languages are syntactically similar. Bilingual word distribution representation can construct the relationship between two languages, so this paper proposes a bilingual word distribution representation model based on Chinese-Thai parallel sentence pairs. The experimental results show that the accuracy of word distribution representation is 82.60 and that the method of parsing dependent syntax in Thai language is based on transfer learning. The research on the parsing method of dependency syntax in Chinese is more mature. Therefore, based on the distribution representation of bilingual words, this paper applies 40000 Chinese and Thai parallel sentences to the corpus. This paper studies the syntactic analysis of dependency in Thai by means of transferring features from Chinese. The neural network model for parsing Thai dependency syntax is presented in this paper. The accuracy rate of dependency arc, identification accuracy and sentence root node are 79.28%, 75.01% and 91.25%, respectively. With the help of Java language, the dependency statements in CoNLL format are output, and the interfacial display is carried out with the help of DependencyViewer tools, which can conveniently observe the horizontal dependent view and tree view of the whole sentence. Through the above analysis, the idea of cross-language transfer learning solves the problem of the scarcity of language data to a certain extent. However, this paper also takes into account the language characteristics of Thai language to analyze the dependency syntax of Thai language. Bilingual word distribution is the preparation for transfer learning syntactic analysis. Transfer learning syntactic analysis is a concrete application of bilingual word distribution representation and good results have been obtained.
【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】