当前位置:主页 > 科技论文 > 软件论文 >

基于句法结构的术语关系抽取方法研究

发布时间:2018-07-13 09:14
【摘要】:目前,互联网上的数据正在以指数的方式迅速增长,将互联网上内容丰富、形式多样的海量数据转化为知识并将其有效地存储和表示具有极其重要的意义。同时,伴随着自然语言处理技术的不断发展和成熟,从Web开放领域文本中抽取出有用的信息并以此构建知识图谱也成为可能。术语是在特定科学领域中使用的、相对固定的词或短语,可以用来正确标记各个专门领域中的事物、现象、特性、关系和过程,是科学研究和知识交流的有力工具。术语关系体现并表示了一个领域的核心知识,对理解学习领域知识、预测未来趋势具有重要的理论和现实意义。另外,术语关系也可以广泛应用到信息检索、自动问答系统、知识图谱构建等领域。然而,人工从大规模语料中抽取术语关系费时费力。因此自动或半自动抽取术语关系成为研究的热点。本文针对开放领域术语关系的获取进行了研究和探讨,提出了基于句法结构的术语关系抽取方法,并在此基础上构建医疗领域知识图谱,本文的主要贡献总结如下:(1)提出了高精度自举术语模板获取方法,在利用模板进行关系抽取的过程中,关系模板的质量直接影响着抽取结果的质量。我们充分利用Web数据的多样性进行自举迭代,将小规模的术语种子集扩展为大规模的术语关系库。并利用深度学习工具word2vec训练词向量并进行语义相似度计算,根据相似度排序,选择相似度最高的术语关系作为新的种子,其在一定程度上避免了传统自举方法中的语义漂移问题。(2)提出基于依存句法结构的术语关系抽取方法。该方法借助依存句法分析和语义角色标注技术,对语句的句法依存树进行最小子树裁剪,提取以动词为中心的具有语义依存关系的句子主干,使之既能涵盖术语关系的关键信息,又能减少依存句法分析错误所带来的噪音。通过对模板进行泛化,根据核心动词结合文本篇章分析对关系类别进行标注,并利用数据库进行结构化存储,实现快速查询。实验表明,基于句法结构的关系抽取方法能有效的利用结构化特征捕捉到术语语义关系。(3)提出多类型术语关系的知识图谱构建方法,知识图谱可以用结构化的形式描述客观世界的概念、实体、事件及其之间的关系,将信息转换成人类认知世界的形式。本文针对医疗知识图谱的特例研究,通过有效的知识整合解决了医疗数据中知识分散、异构、冗余和碎片化的问题。为机器进一步理解自然语言提供技术支持。为验证所提出方法的有效性,构建医疗领域知识图谱实例。实验结果表明,本文所提出的基于句法结构的术语关系抽取方法具有较高的实用性,实现了术语关系抽取、知识图谱构建过程中一定程度的自动化。
[Abstract]:At present, the data on the Internet is growing exponentially. It is of great significance to convert the mass data rich in content and various forms into knowledge and store and express them effectively. At the same time, with the development and maturity of natural language processing technology, it is possible to extract useful information from Web open domain text and construct knowledge map. Terms are relatively fixed words or phrases used in specific scientific fields, which can be used to correctly mark things, phenomena, characteristics, relationships and processes in various specialized fields. They are powerful tools for scientific research and knowledge exchange. The term relation embodies and represents the core knowledge of a domain, which has important theoretical and practical significance for understanding the knowledge of learning domain and predicting the future trend. In addition, terminology relationships can also be widely used in information retrieval, automatic question and answer system, knowledge map construction and so on. However, it takes time and effort to extract terms from large-scale corpus. Therefore, automatic or semi-automatic extraction of terminology relations has become a hot topic. In this paper, the acquisition of open domain terminology relationship is studied and discussed, and a syntactic structure based extraction method of term relation is proposed, and then the medical domain knowledge map is constructed. The main contributions of this paper are summarized as follows: (1) A high precision bootstrap terminology template acquisition method is proposed. In the process of relational extraction using templates, the quality of relational templates directly affects the quality of extraction results. We make full use of the diversity of Web data to carry out bootstrap iterations and extend the small-scale term seed set to a large term relational database. Word2vec is used to train the word vector and calculate the semantic similarity. According to the similarity ranking, the term relationship with the highest similarity is chosen as the new seed. To some extent, it avoids the semantic drift in the traditional bootstrap method. (2) A term relation extraction method based on dependency syntactic structure is proposed. With the help of dependency syntactic analysis and semantic role tagging techniques, the sentence syntax dependency tree is clipped to extract the sentence trunk with semantic dependency relation centered on verbs. It can not only cover the key information of terminology relations, but also reduce the noise caused by paraphrase errors. By generalizing the template, the relation category is annotated according to the core verb and text analysis, and the database is used for structured storage to realize the fast query. The experimental results show that the relation extraction method based on syntactic structure can effectively capture the semantic relationship of terms by using structured features. (3) A method of constructing knowledge atlas of multi-type term relationships is proposed. Knowledge maps can describe the concepts, entities, events and their relationships of the objective world in a structured form, and transform information into the form of human cognition of the world. Based on the special case study of medical knowledge map, the problems of knowledge dispersion, heterogeneity, redundancy and fragmentation in medical data are solved by effective knowledge integration in this paper. Provide technical support for machine understanding of natural language. In order to verify the effectiveness of the proposed method, an example of knowledge map in medical field was constructed. The experimental results show that the method proposed in this paper based on syntactic structure is of high practicability and realizes term relation extraction and automation in the process of constructing knowledge atlas.
【学位授予单位】:北京交通大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前6条

1 吴e,

本文编号:2118878


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2118878.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户0c525***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com