日汉数字时间表达式的识别与翻译研究

发布时间：2018-10-18 09:01

【摘要】：命名实体识别及翻译是自然语言处理中重要的基础任务。数字时间表达式作为一类特殊的命名实体包含了关键信息,其识别与翻译具有重要的理论意义和实用价值。数字时间表达式的识别与分析是信息检索、事件抽取、事件检测跟踪及问答系统等自然语言处理任务的重要基础。尤其在机器翻译等多语言处理任务中,数字时间表达式的对齐及其翻译质量也是影响机器翻译系统性能的重要因素。数字时间表达式识别与翻译的研究对于提高机器翻译系统性能及推进人工智能快速发展具有重要意义。本文从日汉双语数字时间表达式的特性出发,将语言学知识与统计方法相结合,通过大量的数据分析和实验,对日汉双语数字时间表达式的识别与翻译方法进行了深入的研究和探索并将其应用于机器翻译系统。本文的主要研究工作如下:(1)基于最新的TIMEX3时间标注规范和通用的数字分类方式,结合日汉语言学知识中同构和异构情况,分别针对日语和汉语的数字时间表达式建立了触发词、边界词等关键词知识库,并将表达“概数”含义的词包含在数字时间表达式识别范围中,使得数字时间表达式具有更丰富的含义;然后利用正则匹配的方式对数字时间表达式进行识别;最后将以上基于规则与基于统计的识别方法相融合,分别实现对日语和汉语数字时间表达式的识别。实验结果表明,该识别方法在日语和汉语上都有较好的表现。(2)在传统的词对齐方法中融入双语数字时间表达式对齐,提出了一种基于位置约束和相似度度量相结合的数字时间表达式双向对齐算法,实验结果表明该算法能有效提高双语词对齐性能,辅助机器翻译系统训练生成更优的翻译模型。(3)根据日汉数字时间表达式的翻译特点,建立数字时间表达式的翻译规则库,专用于数字时间表达式的独立翻译,并将双语数字时间表达式的识别及对齐信息和翻译规则库有效融合到现有的统计机器翻译系统中,提升机器翻译中关于数字时间表达式及其邻近词的翻译准确性,进而提升整体翻译效果,并通过实验得以验证。综上所述,本文创新工作主要体现在:根据日汉数字时间表达式的特性,基于TIMEX3标注对时间词的识别和翻译规则进行设计、将“概数”词纳入数字时间表达式识别范围;并提出一种基于位置约束和相似度度量的数字时间表达式双向对齐算法;以及建立日汉数字时间表达式的翻译规则库。最终将这三方面研究内容应用于机器翻译系统,实验验证其有效地改善了机器翻译系统的整体性能。
[Abstract]:Named entity recognition and translation are important basic tasks in natural language processing. As a special named entity, digital time expression contains key information, and its recognition and translation have important theoretical significance and practical value. Recognition and analysis of digital time expressions are the important foundation of natural language processing tasks such as information retrieval, event extraction, event detection and tracking, and question and answer system. Especially in multilingual processing tasks such as machine translation, the alignment of digital time expressions and their translation quality are also important factors affecting the performance of machine translation systems. The research of digital time expression recognition and translation is of great significance to improve the performance of machine translation system and promote the rapid development of artificial intelligence. Based on the characteristics of Japanese and Chinese bilingual digital time expressions, this paper combines linguistic knowledge with statistical methods, and through a large number of data analysis and experiments, The recognition and translation methods of Japanese and Chinese bilingual digital time expressions are deeply studied and applied to machine translation systems. The main research work of this paper is as follows: (1) based on the latest TIMEX3 time labeling specification and the general numerical classification method, combined with the isomorphism and heterogeneity of Japanese and Chinese language knowledge, the trigger words are established for Japanese and Chinese digital time expressions, respectively. The knowledge base of keywords such as boundary words, and the words expressing the meaning of "approximate number" are included in the recognition range of digital time expression, which makes digital time expression have richer meaning. Then the digital time expression is recognized by regular matching. Finally, the recognition of Japanese and Chinese digital time expressions is realized by combining the above rule-based and statistical recognition methods. The experimental results show that the method has a good performance in both Japanese and Chinese. (2) the bilingual digital time expression alignment is incorporated into the traditional word alignment method. A bidirectional alignment algorithm of digital time expressions based on position constraint and similarity measure is proposed. The experimental results show that the algorithm can effectively improve the performance of bilingual word alignment. The auxiliary machine translation system trains to generate a better translation model. (3) according to the translation characteristics of Japanese and Chinese digital time expressions, a translation rule base of digital time expressions is established, which is used for the independent translation of digital time expressions. The recognition and alignment information of bilingual digital time expressions and translation rules are effectively integrated into the existing statistical machine translation system to improve the accuracy of translation of digital time expressions and their adjacent words in machine translation. Thus, the overall translation effect can be improved and verified by experiments. To sum up, the innovative work of this paper is mainly reflected in: according to the characteristics of Japanese and Chinese digital time expressions, the recognition and translation rules of time words are designed based on TIMEX3 annotation, and the "estimate" words are brought into the recognition scope of digital time expressions; A bidirectional alignment algorithm for digital time expressions based on position constraints and similarity measures is proposed, and a translation rule base of Japanese and Chinese digital time expressions is established. Finally, these three aspects are applied to the machine translation system, and the experimental results show that it improves the overall performance of the machine translation system effectively.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】