中文语义角色标注语料库的构建及统计分析

发布时间：2018-01-11 08:11

本文关键词：中文语义角色标注语料库的构建及统计分析　出处：《鲁东大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着信息科技的迅猛发展,自然语言处理对人类生活的影响越来越大。在自然语言处理中,如何让计算机理解人类语言从而实现人机交互,是一个亟待解决的重要问题。汉语的自动分词和词性标注虽运用较低层面的语言知识和一定统计方法已经取得较高的正确率,但对于一些歧义问题还无法处理,需要留待句法和语义分析阶段才能彻底解决。对于自然语言理解,句法分析只是其中的一种手段,语义分析则是其中的关键和难点,没有语义分析的支撑,自动句法分析也将举步维艰。在实现人工智能的过程中,语义分析表现出前所未有的重要性和迫切性,要使自然语言处理系统兼备计算机的速度和人类的智能,就不能不进行一定深度的语义分析。本文在已有的句法树库的基础上,构建了一定规模的语义角色标注语料库。首先,依据HowNet格框架词典和《现代汉语谓词语义角色标注语料库规范》对该语料库进行了语义角色标注(主要包括人工标注和人工校对两个环节);其次,通过人工标注,对本文标注体系进行了修改和完善,对语义角色标注规则进行了归纳并对该规则进行了有效性检测;最后,对本文的研究内容及研究成果进行了总结。本文共分为六个部分,各部分主要内容介绍如下:第一部分,绪论。主要介绍本文研究的理论背景、研究现状、研究方法以及研究意义。理论背景主要包括配价理论、论元理论、语义角色等。研究现状主要是从语义角色的关系类型、语义角色语料库的构建及语义角色标注方案几个方面进行阐述。在研究方法上,本文主要采用了语料库的方法、人机互助的方法、基于规则与基于统计相结合的方法以及定性与定量相结合等方法。本文研究旨在对句法结构不同、基本逻辑语义相同的句子给出一致标注,建立具有一定规模的语义角色标注语料库,从而对语义分析、自然语言理解做出一定贡献。第一章,语义角色标注语料库。本章主要介绍语义角色标注语料库的语料来源及规模、前期句法库的构建、语义角色关系类型和HowNet格框架词典、语义角色标注平台以及语义角色标注方案等基础性工作。本文语料库的语料来源于《人民日报》,共计4万句;语义角色标注语料库的构建是在前期依存句法树库的基础上进行的,是对自然语言的进一步处理,标注平台是在前期句法树库标注平台的基础上改造而成,可以在句法标注和语义角色标注之间相互转换;语义角色关系类型和标注方案的依据是《现代汉语谓词语义角色标注语料库规范》,但与该规不同的是本文采用hownet格框架词典辅助标注的方法,标注的客观性和准确性有所保障。第二章,语义角色标注过程中的常见问题及处理方法。本章主要总结在人工标注语义角色过程中存在的问题,并针对这些问题提出相应的解决办法。标注问题主要分三个方面:漏标、多标和错标,每个方面又分别从谓词性成分的标注问题和谓词论元的标注问题两个方面分别进行归纳和分析。最后根据存在的问题提出了相应的解决方法:正确挂靠同义词、根据语境选择动词义项等。第三章,格框架词典中存在的问题及解决对策。在对语料进行人工标注的基础上,对格框架词典中动词的义项及其格框架存在的问题进行归纳,分析问题产生的原因,提出相应的解决对策。格框架存在的问题主要有动词语义类的格框架不正确、动词给定语义类不正确、动词给定语义类不全以及未登录词四个方面。其中,动词语义类的格框架不正确包括格框架语义角色不全、格框架必要角色设置错误两个方面;动词给定语义类不全包括动词的语义类归纳不全面、同一语义类的格框架对其中的所有义项并不完全适用两个方面。对于格框架存在问题的原因,主要从格框架词典的设置、词义的演变、同一语义类中动词义项之间的差异、新词的产生等几个方面分别作了详细的阐述。最后,针对问题提出的解决方法是采用句式变换的方法检测格框架以及近义词挂靠。对于格框架不正确的动词语义类及同一语义类的格框架不适用于其中的所有动词的情况,本文采用句式变换的方法对动词的格框架进行验证,其他问题则采用挂靠近义词的方法进行修正。第四章,句式与句模的对应关系及语义角色标注规则。根据语义角色人工标注及校对的结果,以内省的方式归纳出各种句式的典型句模。这些句式主要是主谓句,包括动词谓语句、名词谓语句和形容词谓语句。其中,动词谓语句包括一般动词谓语句、“把”字句、“被”字句、兼语句、连谓句、双宾句、“比”字句等句式。其次,根据有无标记,将句式的典型句模进行规整,总结出一套语义角色标注规则。最后,在测试集中检测规则的有效性并总结规则覆盖范围之外的情况,提出解决策略。在有效性较好的前提下,将该规则应用到后期语义角色标注中,一方面可以发挥规则方法正确率高的优点,降低人工标注的工作量,另一方面可利用这些规则自动检查出纯人工标注过程中的错误,提高语义角色标注的准确率。最后部分,结语。概括本文的主要研究内容、研究成果;总结本文对中文信息处理以及汉语语法、语义研究的意义;最后,分析本文研究的不足之处并对下一步工作进行规划。
[Abstract]:With the rapid development of information technology, Natural Language Processing's impact on human life more and more. In Natural Language Processing, how to make the computer in order to achieve human-computer interaction to understand human language, is an important problem to be solved. Although the rate of correct use of language knowledge and some statistical methods for lower level have achieved higher Chinese automatic segmentation and part of speech tagging, but some questions are not ambiguous, need for syntactic and semantic analysis can be completely resolved. For natural language understanding, syntactic parsing is a kind of means of the semantic analysis is the key and difficult one, no semantic analysis support, automatic syntactic parsing will also be difficult. In the process of implementation of artificial intelligence, semantic analysis showed a hitherto unknown importance and urgency, to make the Natural Language Processing system with computer The speed and human intelligence, semantic analysis can not in certain depth. Based on the existing syntactic Treebank on the construction of a certain scale corpus semantic role labeling. First of all, based on the HowNet framework and the "modern Chinese Dictionary" semantic role labeling corpus specification of semantic role labeling of the corpus (including manual annotation and proofreading of the two links); secondly, through manual labeling, the annotation system was modified and improved, the semantic role labeling rules were summed up and the rules of the effectiveness of detection; finally, the research content and the research results of this paper are summarized in this paper. Is divided into six parts, the main contents of each part as follows: the first part is introduction. The research status and theoretical background, mainly introduces the research, research methods and research significance of the theoretical background. Including the valence theory, argument theory, semantic role. Research is mainly from the relationship between the types of semantic roles, project construction and several semantic roles of semantic roles of corpus annotation are expounded. In research methods, this paper mainly adopts a corpus based approach, method of man-machine interactive method based on rules, and based on statistics and the combination of qualitative and quantitative methods. The aims of this study are different on the syntactic structure, basic logic semantics the same sentence given consistent annotation, establish semantic role with a certain scale of corpus, semantic analysis of natural language understanding to make some contribution. In the first chapter, corpus based semantic role labeling. The origin and scale of corpus. This chapter mainly introduces the corpus of semantic role labeling, construction of sentence semantic role relation type library, and the HowNet framework of semantic dictionary. Role tagging platform and semantic role labeling scheme and other infrastructure work. The corpus comes from the "people's Daily", a total of 40 thousand sentences; semantic role labeling corpus construction is based on the dependency Treebank in the early on, for further processing of natural language, annotation platform is the basic platform in early syntactic annotation the Treebank transform into, can be changed between syntactic annotation and semantic role labeling; semantic role relation type and annotation scheme is based on the "modern Chinese corpus that specification of semantic role labeling, but unlike the gauge is HowNet this paper uses the method of case frame dictionary assisted annotation, objectivity and accuracy of annotation the security. In the second chapter, common problems and treatment methods in the process of semantic role labeling. This chapter mainly summarizes the existing in manual annotation semantic roles in the process of asking Questions, and puts forward the corresponding solutions for these problems. The annotation problem is mainly divided into three aspects: leakage standard, multi standard and wrong standard, each part respectively from two aspects of predicate predicate argument annotation and annotation are summarized and analyzed. Finally, according to the existing problems and corresponding solutions put forward: right anchored synonymous verbs according to the context selection. The third chapter is case frame dictionary, problems and countermeasures. Based on manual annotation of the corpus, the existing meaning and the lattice framework verbal case frame Dictionary of the problem are summarized, analyzes the causes of the problems, put forward the corresponding countermeasures. The main problems are the framework of lattice frame verb semantic class is not correct, the verb to attributive semantic class is not correct, not all verbs to semantic classes and unknown words in four aspects. In the framework of incorrect verb semantic classes including frame semantic roles is not complete, lattice framework necessary character set two aspects of error; verb attributive semantic class to not include the semantics of the verb class induction is not comprehensive, frame the same semantic class does not end on the whole for all the senses of the two aspects. The reason for the framework of existing problems, mainly from the case frame dictionary settings, the evolution of the meaning of the difference between the verbs, the same semantic class, several aspects of the emergence of new words are described in detail. Finally, the solving method and puts forward the method of using sentence transformation detection framework and Synonyms the lattice framework anchored. Verb semantic class lattice framework does not correct and the same semantic class does not apply to all the verbs, this paper uses the method of case frame of the verb sentence transformation is verified, The other problem is corrected by the method of anchored near synonyms. In the fourth chapter, correspondence between sentence and sentence model and semantic role annotation rules. According to the semantic role annotation and proofreading results within the province, summed up the way of the sentence types typical sentence model. These sentences are subject predicate sentences, including verb predicate, noun predicate statement and adjective predicate sentences. The verb predicate sentence including general verb predicate sentences, "Ba", "Bei", and statements, even that sentence, double object sentence, the sentence pattern of "Bi". Secondly, according to the marked, the sentence sentence model of typical structured, summed up a set of semantic role labeling rules. Finally, in the effectiveness of the test set of detection rules and summarize the rules out of the area, proposes the solution strategy. In the premise of effective, apply the rule to the late semantic role labeling, You can play the advantages of high rate of correct rules on the one hand, reduce the workload of manual annotation, on the other hand can use these rules to automatically check out the pure manual annotation errors, improve the accuracy of semantic role labeling. The last part, the conclusion. The main research contents, summarizes the research results of this paper Chinese information; and Chinese grammar, semantic meaning of research; finally, analysis of the inadequacies of the study and plan the next step.

【学位授予单位】：鲁东大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：H146.3

【参考文献】