一种基于词素媒介的汉蒙统计机器翻译方法

发布时间：2018-05-02 07:59

本文选题：中间语言 + 词素　；参考：《中文信息学报》2017年04期

【摘要】：汉蒙语形态差异性及平行语料库规模小制约了汉蒙统计机器翻译性能的提升。该文将蒙古语形态信息引入汉蒙统计机器翻译中,通过将蒙古语切分成词素的形式,构造汉语词和蒙古语词素,以及蒙古语词素和蒙古语的映射关系,弥补汉蒙形态结构上的非对称性,并将词素作为中间语言,通过训练汉语—蒙古语词素以及蒙古语词素-蒙古语统计机器翻译系统,构建出新的短语翻译表和调序模型,并采用多路径解码及多特征的方式融入汉蒙统计机器翻译。实验结果表明,将基于词素媒介构建出的短语翻译表和调序模型引入现有统计机器翻译方法,使得译文在BLEU值上比基线系统有了明显提高,一定程度上消解了数据稀疏和形态差异对汉蒙统计机器翻译的影响。该方法是一种通用的方法,通过词素和短语两个层面信息的结合,实现了两种语言在形态结构上的对称,不仅适用于汉蒙统计机器翻译,还适用于形态非对称且低资源的语言对。
[Abstract]:The differences of Chinese and Mongolian morphology and the small size of parallel corpus restrict the improvement of statistical machine translation performance. In this paper, the morphological information of Mongolian language is introduced into the statistical machine translation of Han and Mongolian languages. By dividing Mongolian language into morpheme forms, this paper constructs the mapping relationship between Chinese words and Mongolian morphemes, as well as Mongolian morphemes and Mongolian morphemes. In order to make up for the asymmetry in morphology and structure of Han and Mongolian, and take morpheme as the intermediate language, a new phrase translation table and order model are constructed by training the morpheme of Chinese and Mongolian and the statistical machine translation system of morpheme and morpheme in Mongolian. Multipath decoding and multi-feature approach are used to integrate the statistical machine translation of Han and Meng. The experimental results show that the phrase translation table and the orchestration model based on morpheme medium are introduced into the existing statistical machine translation methods, and the BLEU value of the translation is significantly higher than that of the baseline system. To some extent, the effects of data sparsity and morphological differences on statistical machine translation in Han and Mongolia are eliminated. This method is a general method. By combining morpheme and phrase information, the two languages are symmetrical in morphology and structure, which is not only suitable for the statistical machine translation of Han and Mongolian. It also applies to asymmetric and low resource language pairs.
【作者单位】：中国科学技术大学自动化系;中国科学院合肥智能机械研究所;
【基金】：国家自然科学基金(61502445,61572462) 中国科学院信息化专项(XXH12504-1-10)
【分类号】：H085.3

【相似文献】