融合语言差异性的汉—越统计机器翻译方法研究

发布时间：2018-04-24 21:02

本文选题：统计机器翻译 + 汉语-越南语　；参考：《昆明理工大学》2017年硕士论文

【摘要】：越南是一个重要的东南亚国家且与中国接壤,并一直与中国存在频繁的政治、经济交往。机器翻译是自然语言处理研究的重要分支之一。研究汉语-越南语统计机器翻译对汉越双语理解、信息检索、文化交流、经济贸易等具有重要的支撑作用。当前汉语到越南语的翻译模型还处于起步阶段,一些主要工作集中在双语平行语料库构建、针对汉越的词对齐方法研究、以及越南语的依存句法树等方面。由于互联网上存在比较少量的汉语到越南语的平行语料,通过稀疏语料训练的翻译模型很难覆盖比较全面的语言知识,其次由于缺乏语言差异性指导,致使翻译模型和解码算法完全依赖语料库规模,增加了引入错误的概率。因此将语言差异性融入进汉越翻译模型是一个有待研究的难点问题。越南语和汉语的语言特征既有相同点又有不同点。相同点都遵循主谓宾结构,不同点在于,越南语中修饰语(定语和状语等等)和被修饰语的位置与汉语成后置关系,即越南语中的形容词位于其修饰的名词之后,副词位于其修饰的形容词和动词之后。基于以上分析,本文从层次短语模型和句法树到树模型,融合语言差异性进行建模与研究:(1)词汇化模型中融合语言差异性的层次短语翻译模型。首先,分别使用中科院中文词性标注和分词工具和越南语分词工具对汉语和越南语双语平行句对进行分词以及标注,通过GIZA++得到双语的词对齐信息。然后利用词对齐信息,抽取出最初短语对,泛化成带有非终结符的规则,然后训练得到层次短语翻译模型。其次通过分析汉语与越南语的差异性,进行语言特性的形式化定义,并将其融入词汇化调序模型中。解码使用CKY算法。在实验中,观察词汇化模型中融合语言差异性的层次短语翻译模型,以及常规层次短语模型在不同文法的语言模型下的对比,实验结果表明词汇化模型中融合语言差异性的层次短语翻译模型提高了翻译效果。(2)融合语言特性的句法树到树翻译模型的汉-越统计机器翻译方法。首先进行句法树解析,生成双语句法树,其次通过GIZA++得到词对齐,通过一一对应的句法树,提取规则对,构建规则库。并利用短语翻译模型的丰富短语对,对源语言与目标语言的解析树进行泛化,扩大规则库。其次利用有效的语言差异特性对规则预处理以及翻译模型的调优。解码过程使用树解析算法,并利用目标语言的泛化指导候选翻译生成。在实验中,观察词汇化模型中融合语言差异性的层次短语,句法树到树,融合语言特性的树到树模型的BLUE值。实验结果表明提出的方法有效的提高了规则库规模的同时提高了翻译的准确性。(3)在融合语言差异性的汉-越句法树到树翻译模型的原型系统。在基于句法树到树翻译系统的,将汉语和越南语的语言差异特性作为特征融入规则库的优化和翻译模型的建模阶段,其次系统构建过程中使用了一些开源的工具和框架,Niutrans翻译框架,中科院分词与标注工具,GIZA++等。系统的前台搭建使用Java Servlet技术,通过翻译模型解码所翻译的句子,最终构建了融合语言差异性的汉-越句法树到树翻译模型的原型系统。
[Abstract]:Vietnam is an important Southeast Asian country and is contiguous with China, and has always existed frequently with China in political and economic exchanges. Machine Translation is one of the important branches of Natural Language Processing research. The study of Chinese Vietnamese statistics Machine Translation has important support for the bilingual understanding, information retrieval, cultural exchange and economic trade of the Chinese Vietnamese. Use. The current translation model of Chinese to Vietnamese is still in its infancy, and some of the main tasks are focused on the construction of bilingual parallel corpus, the study of the word alignment method of Han Yue, and the dependency syntax tree of the Vietnamese language. Because there are a few parallel corpus in the Vietnamese language on the Internet, it is trained through sparse corpus. The translation model is difficult to cover more comprehensive language knowledge. Secondly, due to the lack of language difference guidance, the translation model and decoding algorithm depend entirely on the size of the corpus and increase the probability of introducing errors. Therefore, the integration of language differences into the Han Yue translation model is a difficult problem to be studied. The same points have both the same points and different points. The same points all follow the subject predicate object structure. The difference is that the modifier (attributive and adverbial and so on) and the position of the modifier have a postposition relationship with the Chinese, that is, after the adjective in the Vietnamese language is located in its modified noun, the adverb is located after its modified adjective and verb. From the hierarchical phrase model and the syntactic tree to the tree model, this paper combines language differences to model and study: (1) a hierarchical phrase translation model is fused in the lexicalization model. First, the Chinese and Vietnamese bilingual parallel sentences are used by the Chinese Academy of Chinese word tagging and participle and the Vietnamese word segmentation tool. We use word segmentation and tagging, get the bilingual word alignment information through GIZA++, then use words to align information, draw out the initial phrase pairs, generalize the rules with non terminations, and then train the hierarchical phrase translation model. Secondly, the formal definition of language characteristics is carried out by analyzing the differences between Chinese and Vietnamese. In the lexicalization model, the decoding uses the CKY algorithm. In the experiment, we observe the hierarchical phrase translation model in the lexicalization model, and the contrast of the conventional hierarchical phrase model under the different grammatical language models. The experimental results show that the hierarchical phrase translation model of the linguistic difference is improved in the lexicalization model. The translation effect. (2) the syntactic tree which combines the language characteristics to the Han - Yue Machine Translation method of the tree translation model. First, the syntactic tree is parsed and the bilingual syntactic tree is generated. Secondly, the word alignment is obtained by GIZA++, and the rules pair is extracted by the one-to-one corresponding syntax tree, and the rule base is constructed. The analysis tree of the source language and the target language is generalized, and the rule base is extended. Secondly, the rule preprocessing and the optimization of the translation model are used by the effective language difference characteristics. The decoding process uses the tree analysis algorithm, and uses the generalization of the target language to guide the generation of the candidate translation. In the experiment, we observe the fusion of language differences in the lexicalization model. The hierarchical phrase, the syntax tree to the tree, the BLUE value of the tree to the tree model, the experimental results show that the proposed method improves the size of the rule base effectively while improving the accuracy of the translation. (3) the prototype system based on the syntactic tree to the tree translation system is based on the syntactic tree to the tree translation model. To integrate the characteristics of Chinese and Vietnamese language differences into the optimization of the rule base and the modeling stage of the translation model. Secondly, some open source tools and frameworks are used in the process of the system construction, the Niutrans translation framework, the Chinese Academy of Sciences participle and tagging tools, GIZA++ and so on. The front desk of the system uses the Java Servlet technology, through the turn over. Finally, the prototype system of Chinese Vietnamese syntactic tree translation to tree translation model is constructed.

【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.2

【参考文献】