Web信息抽取中的若干关键问题研究

发布时间：2018-04-26 13:19

本文选题：信息抽取 + 命名实体消歧　；参考：《中国科学技术大学》2015年硕士论文

【摘要】：近年来,随着Web应用的快速发展,互联网上的信息资源越来越丰富。在此背景下,Web信息抽取技术应运而生。Web信息抽取是一种从海量的数据中准确获取用户所需的事实信息的处理技术,涉及实体识别与抽取、关系抽取、实体消歧、观点挖掘和倾向性分析等诸多问题,目前已成为Web领域中的研究热点之一。本文围绕Web信息抽取领域中的两类关键问题——命名实体消歧和倾向性信息抽取——开展了研究。命名实体消岐旨在消除Web中一个命名实体在指代概念上的歧义,从而确定其正确指代的实体。由于Web环境中一个命名实体指称项可以对应多个实体概念,如命名实体指称项“华盛顿”既可以指代美国总统乔治华盛顿也可以指代首府华盛顿哥伦比亚特区。因此,命名实体消歧技术在Web问答系统、信息检索、机器翻译等应用领域有着重要的应用价值。倾向性信息抽取关注于从海量的非结构化的web数据中挖掘出观点信息,继而分析信息发布者对其发布信息的情感倾向性。倾向性信息抽取在现代生活中有着诸多的应用,例如,可以帮助企业准确获取用户对产品的评价,以便优化营销策略；可以为政府部门在舆情监控、突发事件处理等提供决策依据。本文针对命名实体消岐和倾向性信息抽取中存在的主要挑战开展了算法设计、实验验证等工作。论文的主要贡献可总结为如下几点： (1)提出了一种基于维基百科的命名实体消歧方法,通过实体指称项识别、候选实体库构建以及命名实体匹配等过程有效地实现了命名实体消岐。我们在该方法中提出了一种新型的待消歧实体指称项与候选实体之间的相似度计算方法,并利用维基百科的页面来挖掘实体之间、实体指称项与候选实体间的语义关联,最后在WISE Challenge2013数据集上验证了该方法的有效性。 (2)提出了一种基于句法依存关系和SVM的情感评价单元识别算法。情感评价单元在一个情感句中表现为情感倾向词和它修饰的评价对象的搭配,其直接决定情感句的情感倾向性。论文提出的算法首先通过简单模式匹配抽取所有可能的候选情感评价单元,然后通过SVM模型对候选情感单元集合进行过滤。在分类过程中,我们提出了基于句法依存关系来自动构建大规模训练集的方法,提高了分类模型训练的效率。在实际数据集上的实验表明该算法较以往的算法在准确率和召回率上均有明显的改善。
[Abstract]:In recent years, with the rapid development of Web applications, the information resources on the Internet are more and more abundant. Under this background, Web information extraction technology emerges as the times require. Web information extraction is a kind of processing technology that can accurately obtain the fact information that users need from massive data. It involves entity identification and extraction, relation extraction, entity disambiguation. Viewpoint mining and tendency analysis have become one of the research hotspots in Web field. This paper focuses on two kinds of key problems in the field of Web information extraction named entity disambiguation and biased information extraction. The purpose of named entity disambiguation is to eliminate the ambiguity of a named entity in Web in the concept of anaphora, so as to determine the entity with correct reference. Because a named entity reference in Web environment can correspond to several entity concepts, for example, the named entity reference term "Washington" can refer to both U.S. President George Washington and Washington, D.C. Therefore, named entity disambiguation technology has important application value in Web question answering system, information retrieval, machine translation and so on. Tendentiousness information extraction focuses on mining viewpoint information from massive unstructured web data, and then analyzes the emotional tendency of information publishers to publish information. Tendentiousness information extraction has many applications in modern life, for example, it can help enterprises to get accurate evaluation of products by users, in order to optimize marketing strategy, and can monitor public opinion for government departments. Emergency handling provides the basis for decision-making. In this paper, the algorithm design and experimental verification are carried out to solve the main challenges in the information extraction of named entities. The main contributions of the paper can be summarized as follows: (1) A named entity disambiguation method based on Wikipedia is proposed, which can effectively realize named entity disambiguation through entity reference identification, candidate entity library construction and named entity matching. In this method, we propose a new method to calculate the similarity between entity reference items and candidate entities, and use Wikipedia pages to mine the semantic association between entities, entity references and candidate entities. Finally, the effectiveness of the method is verified on the WISE Challenge2013 dataset. (2) A recognition algorithm of emotion evaluation unit based on syntactic dependency and SVM is proposed. The affective evaluation unit in an affective sentence is expressed as the collocation of the affective tendency word and the object it modifies, which directly determines the affective tendency of the affective sentence. The proposed algorithm firstly extracts all possible candidate emotion evaluation units by simple pattern matching, and then filters the set of candidate emotion units through SVM model. In the process of classification, we propose a method of automatically constructing large-scale training set based on syntactic dependency, which improves the efficiency of classification model training. The experiments on the actual data sets show that the proposed algorithm has better accuracy and recall than the previous algorithms.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.1

【参考文献】