基于AMR的中文句子语义标注及统计分析

发布时间：2018-06-06 07:22

本文选题：句子语义 + 语义标注　；参考：《南京师范大学》2017年硕士论文

【摘要】：一直以来,语义分析都是自然语言处理领域的一大难题。在如今的大数据时代,基于机器学习的词性标注、句法分析研究已经日益完善,机器翻译、人工智能等领域的发展越发依赖深入的句子语义分析。AMR (Abstract Meaning Representation)作为一种句子语义表示方法,其语义表示结果是一个单根有向无环图;而且AMR表示的是句中的概念以及概念间的关系,在从词语到概念以及关系的抽象过程中,可以根据句子的语义适当新增概念或删减句中的词语。因此,较之其他语义表示方法,AMR可以更完整地表示句子中丰富的语义信息。但是AMR目前主要是针对英文展开的研究,其体系并不适用于中文句子的语义表示。基于以上原因,本文决定将中文句子的语义表示作为研究目标,在详细梳理了 AMR的发展历程、体系及AMR的自动分析等内容后,以AMR的体系为基础,建立一套适用于中文句子的抽象语义表示方法(ChineseAMR, CAMR)。本研究建立的CAMR标注体系主要包括两部分:对AMR的继承与发展以及CAMR标注规范。该标注规范不仅制定了一套详细的标记集,而且对中文中常见的和特殊的语言现象作了细致的定义。其中,标记集分为概念和关系两部分。概念部分不仅仅对表示回指、语气、各种疑问代词、数量类型、专有名词等的处理做了规定,还增加了表复句的概念。关系部分共包括5种核心语义关系,42种非核心语义关系。规范中的每一条细则都给出了具体的中文示例。在制定的标注规范的基础上,本文展开了第二项工作——语料标注。整个语料标注过程分为两个阶段。第一阶段选取了中文版《小王子》进行标注。在语料的标注过程中,根据语料的实际分析需求,反复讨论修改标记集,不断完善CAMR的标注规范;第二阶段在仔细比较了多种语料的基础上,选取了中文宾州树库(CTB)语料作为标注对象。最终共标注得到《小王子》语料1562句,CTB语料5000句。在语料标注完成后,本文又针对CAMR的一系列特点进行了相应的统计分析。首先,针对CAMR的分析结果是单根有向无环图的这个特点对语料进行了统计,发现语料中有39.96%的句子是图结构,这有力地证明了用图结构来表示中文句子的语义是必要的。接着,针对CAMR可以新增概念和删减词语的这一特点进行了统计,发现语料中有95.2%的句子在用CAMR表示时,进行了新增概念的操作,有96.94%的句子进行了删减词语的操作。这说明了在表示句子语义时,新增概念和删减词语这种抽象是必要的,也进一步证明了 CAMR继承AMR,使用抽象的方法来表示句子语义是合理且必要的。最后,鉴于谓词一直都是句法语义研究的重点,而在CAMR中,谓词义项通过不同的论元结构来区分,所以本研究统计了语料中谓词义项的论元使用情况,得到了一个关于谓词义项的论元词典,该义项词典可供其他语言学研究者使用。
[Abstract]:Semantic analysis has always been a difficult problem in the field of natural language processing. In the era of big data, the research of parse analysis based on machine learning has become more and more perfect, and machine translation is becoming more and more important. The development of artificial intelligence and other fields rely more and more on the in-depth sentence semantic analysis. AMR Abstract Meaning representation as a sentence semantic representation method, the result of semantic representation is a single-root directed acyclic graph. Moreover, AMR denotes the concept in sentence and the relationship between concepts. In the abstract process from words to concepts and relations, we can add concepts or delete words in subtractive sentences according to the semantics of sentences. Therefore, compared with other semantic representations, AMR can represent the abundant semantic information in sentences more completely. However, AMR is mainly focused on English, and its system is not suitable for the semantic representation of Chinese sentences. For the above reasons, this paper decides to take the semantic representation of Chinese sentences as the research goal. After combing in detail the development course, system and automatic analysis of AMR, this paper bases on the system of AMR. To establish a set of abstract semantic representation methods for Chinese sentences. The CAMR annotation system established in this paper consists of two parts: the inheritance and development of AMR and the specification of CAMR annotation. The specification not only makes a detailed set of tags, but also gives a detailed definition of common and special linguistic phenomena in Chinese. The tag set is divided into two parts: concept and relation. The conceptual part not only provides for the treatment of anaphora, mood, various interrogative pronouns, quantity types, proper nouns, but also adds the concept of complex sentences. The relationship part consists of 5 core semantic relationships and 42 non-core semantic relationships. Each detail in the specification gives concrete examples in Chinese. On the basis of the label specification, the second work, corpus annotation, is carried out in this paper. The whole process of corpus tagging is divided into two stages. The first stage selected the Chinese version of "Little Prince" to mark. In the process of corpus tagging, according to the actual needs of data analysis, we repeatedly discuss the revision of marking set, and constantly improve the annotation specification of CAMR. In the second stage, on the basis of careful comparison of many kinds of data, The CTB corpus is selected as the tagging object. Finally, a total of 1562 sentences and 5000 sentences of CTB corpus were obtained by tagging Little Prince. After the completion of corpus tagging, this paper makes a statistical analysis of a series of characteristics of CAMR. First of all, in view of the fact that the result of CAMR analysis is single directed acyclic graph, this paper makes statistics on the corpus. It is found that 39.96% of the sentences in the corpus are graph structures, which proves that it is necessary to use graph structure to represent the semantics of Chinese sentences. Then, according to the feature that CAMR can add new concepts and delete words, it is found that 95.2% of the sentences in the corpus operate on the new concepts and 96.94% of the sentences have the operation of deleting words when they are expressed in CAMR. This shows that it is necessary to add new concepts and delete words in the representation of sentence semantics, and further proves that it is reasonable and necessary to use abstract methods to express sentence semantics by inheriting AMRs. Finally, in view of the fact that predicate has always been the focus of syntactic and semantic research, in CAMR, predicate meanings are distinguished by different argument structures. A lexicon of predicate meanings is obtained, which can be used by other linguistic researchers.
【学位授予单位】：南京师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：H146.3

【参考文献】