汉—老双语词语对齐及依存树库构建方法研究
发布时间:2018-05-01 03:18
本文选题:汉语 + 老挝语 ; 参考:《昆明理工大学》2017年硕士论文
【摘要】:随着科技和社会经济的快速发展,伴随着跨语言沟通的不断深化,全球互联已成为不可抗拒的发展趋势。面对互联网上的数量巨大且实时动态变化的多语言信息,仅仅依赖人工翻译来处理这些数据简直就是天方夜谭,唯一的解决方案就是充分利用机器翻译技术来实现自动翻译服务,由此掀起了研究机器翻译领域的浪潮。语言上的互相沟通和理解是国与国之间进行经济文化各方面之间交流的基础,中国和老挝也不例外,对汉-老双语进行深入的研究也可以为构建汉语-老挝语双语语料资源打下基础。在自然语言处理中,双语词对齐是一个十分重要的基础工作,它将双语平行语料库中互为翻译的一对双语语言之间的关系看作一根连线,而这些对齐关系可以为机器翻译提供有价值的参考知识。在自然语言研究领域中的许多应用,例如:构建依存树库,双语字典编纂、机器翻译、双语信息抽取等应用,双语词对齐都能为它们提供基础性支持。对汉-老双语词语自动对齐方法的深入研究并且在此基础上构建具有一定规模的双语平行语料库在汉-老双语信息化处理中有着举足轻重的地位。本文通过分析汉语和老挝语这两种语言在语法结构上的异同点,在汉-老双语自动词对齐的方法和在基于汉-老双语词对齐语料的基础上构建老挝语依存树库的方法进行相关研究,具有特色的研究工作有以下几点:(1)首先对汉语老挝语两种语言在语法特点上存在的差别展开分析,通过分析发现,汉语和老挝语的句子结构中修饰词与中心词之间存在顺序错位的情况,从这一特点入手,筛选出一些双语特征,对汉-老双语词对齐加以约束。(2)将句法特征的融入到统计词对齐算法中,对汉-老双语自动词对齐算法加以约束。汉语和老挝语在语法和句法结构上均存在巨大差异,汉-老双语自动词对齐实现的困难较大,因此本文提出一种融合多种句法特征的汉-老双语自动词对齐方法。首先分析和选取中老双语的一些句法特征,对这些特征进行整合并构建模型,使用对数线性模型框架并在最小错误率算法的条件下训练模型。实验以IBM3为基础比对模型,结果表明该双语词对齐方法取得了很好的对齐效果,明显优于基础模型。(3)提出了通过汉-老双语词对齐语料来构建老挝语依存树库的方法。在前期的文献调查中,我们发现国内外目前针对老挝语研究工作相对较少且没有建立较大规模的依存树库,而人工方法构建老挝语依存树库困难重重,所以本文提出了一种借助汉-老双语词对齐语料构建老挝语依存树库的方法。在已经获取汉-老双语词对齐平行语料的基础上,首先对平行语料中的汉语句子进行依存句法分析,然后结合老挝语自身语言特点,在依存句法规则的基础上将汉语句子的依存关系通过汉-老双语词对齐关系映射到老挝语句子中,最终生成老挝语句子的依存树。在实验中,将该方法和传统的机器学习的方法进行比较,结果表明该方法的准确率得到了明显提高,并且简化了构建老挝语依存树库过程中的人工标注收集工作,节省了大量的人力物力,可以在老挝语语料稀缺的情况下快速的构建质量较好的老挝语依存树库。
[Abstract]:With the rapid development of science and technology and social economy, with the continuous deepening of cross language communication, the global interconnection has become an irresistible trend. Facing the huge and real-time and dynamic multilingual information on the Internet, relying solely on artificial translation to deal with these data is simply the night, the only solution is It is to make full use of Machine Translation technology to realize automatic translation service, and thus set off a wave of research in the field of Machine Translation. Language communication and understanding are the basis for the exchange of economic and cultural aspects between countries and countries. China and Laos are no exception. The in-depth study of Chinese and old bilingualism can also be used to build Chinese. In the Natural Language Processing, bilingual word alignment is a very important basic work in the bilingual corpus of bilingual words. It regards the relationship between bilingual parallel corpus as a link between a pair of bilingual languages, which can provide valuable reference knowledge for Machine Translation. Many applications in the field of language research, such as building dependency tree library, bilingual dictionary compilation, Machine Translation, bilingual information extraction, can provide basic support for bilingual word alignment. A thorough study of the automatic alignment method of Chinese and old bilingual words and the construction of a bilingual parallel corpus with a certain scale on this basis. By analyzing the similarities and differences of the grammatical structure between the two languages of Chinese and Laos, this paper studies the methods for the alignment of Chinese and old bilingual words and the method of constructing the Laotian dependency tree base on the basis of the align corpus of Chinese and old bilingual words. The characteristics of the research are as follows: (1) first, the analysis of the differences in the grammatical characteristics of the two languages of the Chinese Laos is first analyzed. Through the analysis, it is found that the sequence of the modifiers and the central words in the sentence structure of Chinese and Laos are in the wrong order. From this feature, some bilingual features are screened out, and the Chinese and old bilingualism are selected. The word alignment is constrained. (2) the syntactic features are incorporated into the statistical word alignment algorithm, and the Chinese and old bilingual word alignment algorithms are constrained. There are great differences in the grammatical and syntactic structure between Chinese and Laos, and the difficulties in realizing the alignment of Chinese and old bilingual words are more difficult. Therefore, this paper puts forward a kind of syntactic feature. This paper firstly analyzes and selects some syntactic features of Chinese and old bilinguals, integrates these features and constructs the model, uses a logarithmic linear model framework and trains the model under the minimum error rate algorithm. The experiment is based on the IBM3 based comparison model. The results show that the bilingual word alignment method is very good. The alignment effect is obviously superior to that of the basic model. (3) a method of constructing the Laos dependency tree base through Chinese and old bilingual words is proposed. In the previous literature survey, we found that there are relatively few Laos research work at home and abroad, and there is no larger norm dependent dependency tree, and the artificial method is used to construct Lao language. It is difficult to save the tree bank, so this paper puts forward a method of constructing the Laos dependency tree base with the alignment corpus of Chinese and old bilingual words. On the basis of the alignment of the parallel corpus of Chinese and old bilingual words, the Chinese sentences in the parallel corpus are analyzed with dependency syntax, and then the dependency sentence is combined with the language characteristics of the Laos. On the basis of the rule of law, the dependency relationship of Chinese sentences is mapped to the Laotian sentence, and the dependency tree of the Laos sentence is generated. In the experiment, the method is compared with the traditional machine learning method. The result shows that the accuracy of the method is obviously improved and the structure is simplified. The manual labelling collection in the process of building the Laos dependency tree can save a lot of manpower and material resources, and can quickly build a good Laotian dependency tree base in the case of the scarce Lao language.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前6条
1 杨蓓;周兰江;余正涛;刘丽佳;;半监督学习的老挝语词性标注方法研究[J];计算机科学;2016年09期
2 曹井香;黄德根;王伟;王帅军;;中英平行短语依存树库构建[J];大连理工大学学报;2014年01期
3 银莎格;;国内老挝语研究综述[J];铜仁学院学报;2014年01期
4 车万翔;张梅山;刘挺;;基于主动学习的中文依存句法分析[J];中文信息学报;2012年02期
5 吕学强,吴宏林,姚天顺;无双语词典的英汉词对齐[J];计算机学报;2004年08期
6 刘群;统计机器翻译综述[J];中文信息学报;2003年04期
相关博士学位论文 前2条
1 刘乐茂;统计机器翻译判别式训练方法研究[D];哈尔滨工业大学;2013年
2 黄书剑;统计机器翻译中的词对齐研究[D];南京大学;2012年
相关硕士学位论文 前3条
1 卢文杰;老挝语和汉语量词对比研究[D];广西民族大学;2013年
2 阮华刚;基于IBM模型的汉—越双语词语对齐研究[D];昆明理工大学;2013年
3 陈鑫;基于主动学习的汉语依存树库构建[D];哈尔滨工业大学;2011年
,本文编号:1827496
本文链接:https://www.wllwen.com/jingjilunwen/jiliangjingjilunwen/1827496.html