基于Web的大规模平行语料库构建方法研究

发布时间：2018-05-28 12:54

本文选题：Web信息挖掘 + 双语平行语料库　；参考：《苏州大学》2012年硕士论文

【摘要】：大规模平行语料库是机器翻译、跨语言信息检索等自然语言处理应用的重要资源。互联网上存在着海量的多语言平行资源,以往的一些研究都致力于从一些多语网站中获取平行(即互为翻译)的单语网页对,进而获取平行语料。虽然许多机构都已经展开建设双语平行语料库的工作,但现有语料库在数量、质量以及领域覆盖性等方面还不能满足处理真实文本的需要。目前,学者发现在Web上双语平行资源不仅存在于两个平行的单语网页对中,还存在于双语混合网页中,且存在于双语混合网页内部的平行资源翻译质量更高、数据规模更大、领域覆盖更广。本文的研究就是基于双语混合网页展开,致力于研究如何自动构建一个大规模双语平行语料库。取得的主要成果归纳如下： (?)探索基于Web获取双语混合网页互联网中索引了海量的网页,如何准确获取双语混合网页是个充满挑战的任务。以往的研究都是采用限定目标源的方法,即预先收集大量的源站点(比如英语学习网站、翻译网站等),然后递归下载所有内部网页作为候选双语混合网页。但是该方法中源站点的选择需要人工干预,且获取的网页数量有限。为了克服这些缺点,还有-些研究提出利用搜索引擎和启发式信息自动筛选得到候选源站点,但得到的候选资源良莠不齐,会下载到大量噪音网页。本文提出了一种借助搜索引擎和已获取的小规模平行语料来递归地发现并获取双语混合网页的方法,实验结果表明该方法能够快速地、准确地、持久地获取高质量的双语混合网页。 (?)改进了双语平行资源抽取、对齐技术双语混合网页中不仅包含有用的双语平行资源,还包含一些噪音信息,如广告信息、导航信息等,而且平行资源的存在形式多种多样,这些都给平行资源的抽取工作带来困难；此外,平行资源中的词汇量也大大超出双语词典的范围,这又增加了平行资源对齐工作的难度。本文提出通过自动学习平行资源在网页中的存在形式的方法来抽取平行资源,并使用基于长度、双语词典、翻译模型等方法来提高平行语料库的质量。
[Abstract]:Large-scale parallel corpus is an important resource for natural language processing applications such as machine translation, cross-language information retrieval and so on. There are a lot of multilingual parallel resources on the Internet. Some previous studies have been devoted to obtaining parallel (i.e. translation) single language page pairs from some multilingual websites, and then to obtain parallel corpus. Although many institutions have begun to build bilingual parallel corpus, the existing corpus can not meet the needs of processing real text in terms of quantity, quality and domain coverage. At present, scholars have found that bilingual parallel resources not only exist in two parallel monolingual webpage pairs, but also exist in bilingual mixed web pages on Web, and the translation quality and data scale of parallel resources in bilingual mixed web pages are higher. The scope is wider. This paper is based on bilingual mixed web pages, dedicated to the study of how to automatically build a large scale bilingual parallel corpus. The main results achieved are summarized below: ) Exploring the acquisition of Bilingual mixed Web pages based on Web There are a lot of web pages indexed on the Internet, so how to obtain bilingual mixed web pages accurately is a challenging task. Previous studies have used the method of limiting target sources, that is, collecting a large number of source sites (such as English learning sites, translation sites, etc.) in advance, and then recursively downloading all internal pages as candidates for bilingual mixed pages. However, the selection of source sites in this method requires manual intervention, and the number of web pages obtained is limited. In order to overcome these shortcomings, some studies have proposed to use search engines and heuristic information to automatically filter candidate source sites, but the candidate resources are intermingled and downloaded to a large number of noisy pages. In this paper, a method of recursively discovering and obtaining bilingual mixed web pages by means of search engine and acquired small-scale parallel corpus is proposed. The experimental results show that this method can be used quickly and accurately. Persistent access to high quality bilingual mixed web pages. ) Improved bilingual parallel resource extraction and alignment techniques Bilingual mixed web pages not only contain useful bilingual parallel resources, but also contain some noise information, such as advertising information, navigation information, etc. In addition, the vocabulary of parallel resources is far beyond the scope of bilingual dictionaries, which makes it more difficult to align parallel resources. This paper proposes a method to extract parallel resources by automatically learning the existence of parallel resources in web pages, and to improve the quality of parallel corpus by using methods based on length, bilingual dictionaries and translation models.
【学位授予单位】：苏州大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.09

【引证文献】