当前位置:主页 > 科技论文 > 软件论文 >

汉老双语句子对齐方法研究

发布时间:2018-05-17 08:03

  本文选题:汉语-老挝语 + 句子对齐 ; 参考:《昆明理工大学》2017年硕士论文


【摘要】:双语语料库存储着两种语言在语义上一致的语料资源和信息,是双语语言处理领域的一个重要基础资源,它被广泛地应用在机器翻译、跨语言信息检索、词义消歧、翻译知识提取等方面。对齐是处理双语语料文本的核心,对齐的效果如何,直接关系着未来的自然语言处理相关工作。句子对齐,即是以句子级别为文本单位的文本对齐,是一种从双语语料中寻找出语义上达到匹配的句子对关系的技术。本文根据汉语-老挝语双语的语言特点,着重于研究探讨如何构建汉老双语平行语料库、如何选取高质量的汉老双语文本特征及如何实现融入多特征的汉老双语平行句对抽取等展开相关研究工作,主要完成了以下研究工作。(1)通过探索研究如何构建双语平行语料库,考察分析以维基百科为主的多语言平台中平行语料的分布情况,并制定了一套汉老双语平行语料库构建策略,包括双语语料爬取、正文提取、句子对齐等环节。(2)通过研究分析老挝语的语言特点、总结出汉老双语句法结构方面的异同点,并以此为依据,选取了一系列汉老双语文本特征,包括、词典匹配特征、词共现率特征及数字特征等,为下一步的汉老双语平行句对抽取工作做准备。(3)通过深入探索如何实现汉老双语平行句对抽取,本文提出了一种融入多特征的汉老双语平行句对抽取方法。首先,对从以维基百科为主的多语言平台中获取的双语语料进行预处理,接着使用候选句对抽取方法获得候选平行句对语料集,并通过融合上述文本特征训练支持向量机模型与最大熵模型。最后通过设计实验比较两个分类器的抽取效果及每一个文本特征对对齐效果的影响,证明了支持向量机更为适合本方法,且全文本特征组合的准确率达到了 70.46%,得到了可行且有效的汉老双语平行句对抽取效果。
[Abstract]:The bilingual corpus stores semantically consistent corpus resources and information for both languages. It is an important basic resource in the field of bilingual language processing. It is widely used in machine translation, cross-language information retrieval, word sense disambiguation. Translation knowledge extraction and so on. Alignment is the core of bilingual text processing. The effect of alignment is directly related to the related work of natural language processing in the future. Sentence alignment, which is a kind of text alignment with sentence level as the text unit, is a technique to find out the semantic matched sentence pairs from the bilingual corpus. Based on the linguistic characteristics of Chinese and Lao languages, this paper focuses on how to construct a parallel corpus of Chinese and Lao bilinguals. How to select high quality Chinese and old bilingual text features and how to realize the extraction of bilingual parallel sentences with multiple features are carried out in this paper. The following research work is completed: 1) how to construct a bilingual parallel corpus by exploring how to construct a bilingual parallel corpus, and how to construct a bilingual parallel corpus by exploring how to construct a bilingual parallel corpus. This paper investigates and analyzes the distribution of parallel corpus in the multilingual platform which is based on Wikipedia, and formulates a set of strategies for constructing Chinese and old bilingual parallel corpora, including bilingual corpus crawling, text extraction, and so on. Sentence alignment and other links. (2) by studying and analyzing the language characteristics of Lao, the similarities and differences in the syntactic structure of Chinese and Lao are summarized, and a series of Chinese and old bilingual text features are selected, including dictionary matching features. In order to prepare for the extraction of Chinese and old bilingual parallel sentences in the next step, this paper explores how to realize the extraction of Chinese and old bilingual parallel sentence pairs through further exploring how to realize the extraction of Chinese and old bilingual parallel sentence pairs. In this paper, we propose a multi-feature Chinese-old parallel sentence pair extraction method. Firstly, we preprocess the bilingual corpus obtained from the multilingual platform which is based on Wikipedia, and then obtain the candidate parallel sentence pair corpus using candidate sentence pair extraction method. The support vector machine model and the maximum entropy model are trained by combining the above text features. Finally, by designing experiments to compare the extraction effect of two classifiers and the effect of each text feature on alignment effect, it is proved that support vector machine is more suitable for this method. The accuracy of full text feature combination is 70.46, and a feasible and effective Chinese and old bilingual parallel sentence extraction effect is obtained.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前10条

1 庞伟;;双语语料库构建研究综述[J];信息技术与信息化;2015年03期

2 银莎格;;国内老挝语研究综述[J];铜仁学院学报;2014年01期

3 田生伟;禹龙;杨飞宇;;改进的自适应汉维句子对齐[J];计算机工程与应用;2011年35期

4 才让加;;面向自然语言处理的大规模汉藏(藏汉)双语语料库构建技术研究[J];中文信息学报;2011年06期

5 肖健;徐建;徐晓兰;袁琦;;英中可比语料库中多词表达自动提取与对齐[J];计算机工程与应用;2010年31期

6 张霞;昝红英;张恩展;;汉英句子对齐长度计算方法的研究[J];计算机工程与设计;2009年18期

7 郝秀兰;陶晓鹏;徐和祥;胡运发;;kNN文本分类器类偏斜问题的一种处理对策[J];计算机研究与发展;2009年01期

8 林智勇;郝志峰;杨晓伟;;不平衡数据分类的研究现状[J];计算机应用研究;2008年02期

9 刘超朋;;平行语料库概述[J];燕山大学学报(哲学社会科学版);2007年S1期

10 郝晓燕;常晓明;;中文文本分类研究[J];太原理工大学学报;2006年06期

相关硕士学位论文 前2条

1 卢文杰;老挝语和汉语量词对比研究[D];广西民族大学;2013年

2 罗芳玲;汉语和老挝语句法比较研究[D];广西民族大学;2010年



本文编号:1900600

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1900600.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户5f8e4***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com