专业领域可比语料的构建与评价研究

发布时间：2019-01-02 15:28

【摘要】：双语词典、平行语料库等多语言资源是解决跨语言障碍,进行多语言信息处理与服务的基础资源,同时这些资源在某些领域或语种内也是稀缺资源,存在着获取瓶颈问题。相比之下,可比语料不存在平行语料里译文受原文限制的缺点,容易获取,并且从中提取的双语词对可用来扩充双语词典,因此可比语料的构建研究是一项很有意义的研究工作。一方面,可以丰富语料构建的理论体系,另一方面,可以为多语言信息处理提供丰富并且可用的多语言语料资源。现有的可比语料库构建主要针对新闻等通用领域,但实际应用中有关专业领域可比语料的应用需求也非常迫切；并且由于专业领域和通用领域的语料特点存在诸多不同,使得通用领域的可比语料构建和评价方法及技术并不一定适用于专业领域的可比语料研究。基于此,本文对专业领域可比语料构建及评价问题进行研究,探索中英领域可比语料的采集方法,并以跨语言相似度为基础引入主题维度进行语料可比度度量研究,最后通过内部评价和外部评价对可比语料的质量进行综合评估。在中英领域可比语料的采集研究中,本文分别以Web搜索引擎、在线百科全书、中英文学术数据库等三种不同类型的互联网资源作为数据源,进行专业领域可比语料库的构建,并对这些方法进行比较分析。在语料可比度度量研究中,本文以词语为单元,通过基于传统统计的序列相似度(包括卡方统计、spearman系数)、基于词频排序的序列相似度、基于术语度排序的序列相似度等三种不同方法在不同类型语料(平行语料、可比语料、非可比语料等)进行实验,对语料整体进行可比度度量。结果表明：基于术语度排序的方法性能最好,其次是基于词频的方法,基于传统统计的方法性能最差。此外,关于可比语料研究大多采用单一指标,尚未形成较完善统一的评价体系,需要对可比语料的评价进行深入研究。鉴于此,本文从内部评价和外部评价两方面对语料进行综合评估。内部评价中以语料词语总体特征、子语料相似性等为基础进行语料内部一致性的评估；外部评价中通过双语术语抽取任务间接评价语料质量。在不同可比程度的语料(包括平行语料、可比语料、非可比语料)上的双语术语抽取实验结果表明,可比度高的语料上获取的术语质量更高。
[Abstract]:Bilingual dictionaries, parallel corpus and other multilingual resources are the basic resources to solve the cross-language barriers and multilingual information processing and service. At the same time, these resources are also scarce resources in some fields or languages. In contrast, the comparable corpus does not have the disadvantage that the translation of the parallel corpus is restricted by the original text, and it is easy to obtain, and the bilingual pairs extracted from it can be used to expand the bilingual dictionary. Therefore, the construction of comparable corpus is a meaningful research work. On the one hand, it can enrich the theoretical system of corpus construction, on the other hand, it can provide abundant and usable multilingual data resources for multilingual information processing. The existing comparable corpus construction is mainly aimed at the general field such as news, but the application demand of the professional domain comparable corpus is also very urgent in the practical application. Because there are many differences between professional domain and general domain, the methods and techniques of comparable corpus construction and evaluation in general domain are not necessarily suitable for the research of comparable corpus in professional field. Based on this, this paper studies the construction and evaluation of professional domain comparable corpus, explores the methods of collecting Chinese and English domain comparable data, and introduces topic dimension to study the measurement of corpus comparability on the basis of cross-language similarity. Finally, the quality of comparable corpus is evaluated synthetically by internal and external evaluation. In the research of Chinese and English domain comparable corpus, this paper uses Web search engine, online encyclopedia, Chinese and English academic database as data sources to construct the professional domain comparable corpus. These methods are compared and analyzed. In the research of Corpus comparability, this paper takes words as the unit, through the traditional statistical sequence similarity (including chi-square statistics, spearman coefficient), based on word frequency ranking sequence similarity. Three different methods, such as sequence similarity degree based on term degree ranking, are experimented in different types of corpus (parallel corpus, comparable corpus, non-comparable corpus, etc.) to measure the comparability of the whole corpus. The results show that the performance of the method based on term degree ranking is the best, followed by the method based on word frequency, and the method based on traditional statistics has the worst performance. In addition, most of the research on comparable corpus is based on a single index, which has not yet formed a perfect and unified evaluation system, so it is necessary to conduct in-depth research on the evaluation of comparable corpus. In view of this, this paper evaluates the corpus from two aspects: internal evaluation and external evaluation. The internal evaluation is based on the general characteristics of the corpus and the similarity of the sub-corpus. In the external evaluation, the quality of the corpus is indirectly evaluated through the task of extracting the bilingual terms. The experimental results of bilingual terminology extraction on different comparable data (including parallel data, comparable data and non-comparable data) show that the quality of the terms obtained on the high comparable data is higher.
【学位授予单位】：南京理工大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】