神经网络机器翻译中未登录词处理方法研究

发布时间：2019-05-24 04:13

【摘要】：神经网络机器翻译(neural machine translation,NMT)是一种新的基于编码-解码网络框架的机器翻译模型,其在各种翻译任务中都表现出了远远优于传统方法的性能。由于GPU内存和计算时间的限制,NMT只能维持一个包含最频繁词的相对有限的词表,词表外的未登录词(out of vocabulary,OOV)通常被表示为一个符号unk。其中源端句子中出现的unk会增加翻译的歧义性,同时NMT本身也无法处理翻译结果中的unk,只能借助一个额外的后处理方法。本课题针对OOV所带来的问题,把NMT的翻译过程分为“预处理”,“模型中”,“后处理”三个阶段,分别在这三个阶段对未登录词的处理方法进行了研究。首先在“后处理”阶段,本文针对现有的NMT中OOV后处理方法的缺点,提出了一种基于上下文的信息的NMT未登录词后处理方法。该方法首先为unk构造了多个未登录候选词,为每一个候选词提取了多个角度的上下文特征,之后通过一个pairwise的排序学习模型选择出最适合的OOV替换翻译结果中的unk。实验结果表明我们的方法可以显著地提高翻译结果中的OOV召回率。其次在“预处理”阶段,本文针对NMT中OOV产生的歧义问题,尝试使用相似词和聚类信息2种不同粒度的语义单元对OOV进行表示。我们在预处理阶段使用语义表示对NMT的训练和测试语料中的OOV进行替换,使用替换后的语料分别进行NMT的训练和测试,并在测试完成后恢复之前替换的翻译结果。实验结果表明使用词类预处理OOV可以明显地提升翻译质量。最后在“模型中”阶段,本文提出了一种OOV的层次聚类词向量的方法。我们使用聚类方法为OOV建立一个层次的语义表示,并把它嵌入到了NMT的模型中。这种层次的结构不仅可以在源端为OOV消除歧义,而且能为目标端的unk利用NMT中的上下文信息选择翻译词。同时我们引入的聚类向量还能缓解OOV的稀疏问题。实验结果表明模型在中-英翻译任务上比Baseline提升了1.43到2.06个BLEU值。
[Abstract]:Neural network machine translation (neural machine translation,NMT) is a new machine translation model based on coding-decoding network framework, which shows much better performance than the traditional methods in all kinds of translation tasks. Due to the limitations of GPU memory and computing time, NMT can only maintain a relatively limited list of words containing the most frequent words. The unlogged word (out of vocabulary,OOV outside the vocabulary is usually represented as a symbol unk.. The unk in the source sentence will increase the ambiguity of translation, and NMT itself can not deal with the unk, in the translation results with the help of an additional post-processing method. In order to solve the problems caused by OOV, the translation process of NMT is divided into three stages: "preprocessing", "model" and "post-processing". In these three stages, the processing methods of unknown words are studied respectively. First of all, in the "post-processing" stage, aiming at the shortcomings of the existing OOV post-processing methods in NMT, this paper proposes a context-based information based NMT unlogged word post-processing method. In this method, multiple unlogged candidate words are constructed for unk, and the context features of multiple angles are extracted for each candidate word, and then the most suitable OOV to replace the unk. in the translation result is selected by a pairwise sort learning model. The experimental results show that our method can significantly improve the OOV recall rate in translation results. Secondly, in the stage of "preprocessing", aiming at the ambiguity caused by OOV in NMT, this paper attempts to use two different granularity semantic units of similar words and clustering information to represent OOV. In the preprocessing phase, we use semantic representation to replace OOV in NMT training and test corpus, and use the replaced corpus to train and test NMT respectively, and replace the translation results before recovery after the test is completed. The experimental results show that the use of part-of-speech preprocessing OOV can significantly improve the translation quality. Finally, in the "model" stage, this paper proposes a hierarchical clustering word vector method for OOV. We use clustering method to establish a hierarchical semantic representation for OOV and embed it in the model of NMT. This hierarchical structure can not only eliminate ambiguity for OOV on the source side, but also select translation words for unk on the target side by using the context information in NMT. At the same time, the clustering vector introduced by us can also alleviate the sparse problem of OOV. The experimental results show that the BLEU value of the model is 1.43 to 2.06 higher than that of Baseline in Chinese-English translation tasks.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】