基于命名实体的维汉翻译规则及资源建设研究

发布时间：2018-07-31 11:10

【摘要】：新疆少数民族地区随着教育的普及,人民的受教育水平的逐步提高,使得新疆少数民族对信息媒体的需求逐日增加,维吾尔文字形式发布的网站数目也在逐年增加。新疆新闻信息网站主要包括政治、经济、军事、外交等社会公共事条报道,及社会突发事件的报道和评论。据了解新疆双语新闻媒体(包括政府各类文件等)在涉及到关于财经、日期、时间等方面的数字翻译问题时,翻译的准确率较低。然而面对海量信息,获取准确的信息数据不仅是研究人员要解决的问题,同样是政府工作人员及查阅信息者的需求。网页新闻数据及政府文献中数字短语的正确翻译是统计机器翻译中一个重要的环节。以此为出发点,本论文的主要研究工作如下：第一：本文首先收集实验所需的维汉双语平行语料,并进行整理加工。语料的收集主要来源是从新疆新闻网站上下载。第二：将数字和时间、日期等命名实体进行详细的分类。本文在分析维汉两种语言中数字和时间等词语构成规律的基础上,对其进行类别划分。第三：人工编写维汉数字识别和翻译规则。针对语料中出现的数字、时间、日期等表达式编写规则,是本论文的核心。本文的创新点在于,目前国内己出现了影响较大的在线翻译系统,如百度、谷歌和有道等,但他们只能实现大语种间的互译,而没有实现少数民族语言与其他语种间的翻译,更不用提维吾尔语到汉语数字短语的翻译。本文采用基于规则的方法实现了维吾尔文到中文的数字与时间表达式的翻译。本文的实验结果表明,对数字和时间等命名实体采用编写规则的方法可以有效地提高短语翻译概率表,从而明显提高了翻译质量。在今后的工作中,将进一步研究如何在统计机器翻译中能更好地发挥规则的方法并完善和扩展。
[Abstract]:With the popularization of education and the gradual improvement of the education level of the people in Xinjiang minority areas, the demand for information media is increasing day by day, and the number of websites published in the form of Uygur language is increasing year by year. The news information websites in Xinjiang mainly include political, economic, military, diplomatic and other social public affairs reports, As well as the reports and comments on social emergencies, it is understood that the accuracy of the translation is low when the bilingual news media of Xinjiang (including various government documents, etc.) is involved in the problem of digital translation concerning finance, date and time. However, to obtain accurate information from the massive information, it is not only a problem to be solved by the researchers, but also the problem that the researchers should solve. This is an important part of the statistical Machine Translation. The main research work of this paper is as follows:
Firstly, this paper collects the parallel Uygur-Chinese bilingual corpus for the experiment, which is downloaded from Xinjiang news website.
Second: make a detailed classification of the named entities, such as the number and time, date and so on. On the basis of the analysis of the constitution of the numbers and the time and other words in the two languages of the Han Dynasty, this paper divides them into categories.
Thirdly, the rules of Uygur-Chinese numeral recognition and translation are written manually. The core of this paper is to write rules of numeral, time, date and other expressions in the corpus.
The innovation point of this paper is that there have been a large number of online translation systems in China, such as Baidu, Google and Tao, but they can only translate between languages in large languages, do not translate between minority languages and other languages, not to mention the translation of Uygur to Chinese digital phrases. The method realizes the translation of digital and temporal expressions from Uighur to Chinese.
The experimental results of this paper show that the method of writing rules for the named entities such as digital and time can effectively improve the phrase translation probability table and improve the quality of translation obviously. In the future work, we will further study how to improve and expand the rule method in the statistical Machine Translation.
【学位授予单位】：西北民族大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：H215;H085

【参考文献】