实体搜索与实体解析方法研究
发布时间:2018-06-23 22:37
本文选题:实体搜索 + 实体解析 ; 参考:《兰州大学》2012年博士论文
【摘要】:从非结构/半结构化数据中快速准确地搜索到各种实体(例如人名、组织机构、产品和药品)及其相关信息成为很多应用的关键,包括信息检索、推荐系统和社交网络等。近几年的研究成果显示,实体相关搜索占互联网查询的很大一部分,并且这个比例在不断上升。相对于单个字符或者指定长度的短语,实体能够更准确的描述文本的语义特征,从而帮助用户快速了解文本的核心内容。然而,随着互联网数据的不断增长,信息检索变得越来越困难,尤其是实体的不唯一性(歧义性)成为一个普遍存在的问题。首先,许多不同的实体拥有完全相同的名称,例如在中国有超过29万人叫“张伟”;在查询框中输入一个实体名称,搜索引擎返回的前100个网页常常会涉及到多个共享相同名字的不同对象。其次,同一个实体常常会以多种形式存在于不同数据源中(即别名),例如“中华人名共和国”常常被称为“中国”或“P.R.C”;刘翔曾被誉为“亚洲飞人”等。在医药业的“一药多名”和“一名多药”问题也很严重,药品名称的不唯一性匹配,为正确用药带来了巨大的阻碍。以上两个问题分别为实体同名歧义和实体别名识别,这两个问题的解决过程是相对的同时也是密切相关的,他们是实体搜索和解析过程中的两个最重要的问题。本篇文章针对实体搜索工作进行了大量的调研,分析了包括表层网络、社交网络以及企业内部网络等不同来源的数据特性。并针对实体同名歧义和实体别名问题分别提出有效的解决方案。此外,基于本文提出的实体同名消歧的解决方案,我们开发了一个人物搜索系统。并对本文提出的别名发现解决方案进行扩展,使其适用于动态数据环境。在这些研究中,我们重在对非结构化文本进行分析,充分利用自然语言处理方法探索文本中的单词、实体、句子的结构特征和内容特征,通过数据挖掘算法为这些信息建立联系,以解决实体搜索和实体解析中遇到的问题。本论文的主要贡献如下: 1.实体搜索综述。介绍了实体搜索中遇到的问题及采用的技术方法,简单描述了现有人名搜索系统、人名搜索相关问题及未来研究方向。 2.实体同名消歧。以人名消歧为例进行相关研究,利用自然语言处理工具对搜索引擎返回的非结构化文档进行命名实体提取,将提取的实体作为人物标签,建立基于实体标签的图结构,最终为拥有相同姓名的不同的人分配实体标签对其进行唯一性描述。另外,我们开发的人名搜索系统将给定的人名作为查询词,输入到现有搜索引擎(即谷歌、雅虎或必应)中,利用我们提出的消歧方法对返回的结果进行人物同名消歧,使得用户可以清晰看到拥有查询人名的不同人物的关键实体信息。 3.实体别名发现。本文对实体-别名之间存在字符串相似性和无字符串相似性的两种情况分别进行研究。对于第一种情况,我们首先基于字符相似性提取出别名候选,然后建立实体-关系图进行别名选取。对于别名与原实体基本不存在字符相似性的情况,研究工作面临更多挑战,本文提出基于实体子集分割的方法进行别名候选的筛选,然后通过主动学习的分类方法来确定给定实体的最终别名。总体来说,本文的实体别名发现方法旨在通过探索给定数据集中实体之间的关系,设计初始过滤方法来提取给定实体的别名候选,然后使用非监督式/监督式方法来探寻给定实体与别名候选之间的相关性,最终为每一个给定实体输出一个别名列表。 4.动态实体别名发现。随着新的数据添加到给定数据集中,基于这个数据集而建立的实体-关系图结构也需要进行相应的更新操作(点边的插入、删除和修改),以往的静态解决方案已不再适用于这样的动态环境,因此,本文提出基于实体索引的路径搜索方法,以此来实现动态图的更新,并将这个动态方案用于增量式的实体别名发现问题中。
[Abstract]:The rapid and accurate search of various entities (such as human names, organizations, products and medicines) and related information from unstructured / semi-structured data and related information has become the key to many applications, including information retrieval, recommendation systems and social networks. Research results in recent years show that entity related search is a large part of the Internet query. And this proportion is rising. As opposed to single character or specified length phrase, the entity can describe the semantic features of the text more accurately, thus helping the user to quickly understand the core content of the text. However, as the Internet data continues to grow, information retrieval becomes more and more difficult, especially the entity is not unique. Meaning) becomes a common problem. First, many different entities have exactly the same names, such as more than 290 thousand people in China called "Zhang Wei"; in the query box, enter an entity name, and the first 100 pages returned by the search engine often involve a number of different objects that share the same name. Secondly, the same Entities often exist in a variety of forms in different sources (alias), such as the "Chinese name republic", which is often referred to as "China" or "P.R.C"; Liu Xiang has been known as "Asian flying man". In the pharmaceutical industry, the "one drug" and "one multidrug" problem are very serious, and the name of the drug is not unique. The two problems are the entity homonym ambiguity and the entity alias identification respectively. The two problems are relative and closely related. They are the two most important problems in the process of entity search and parsing. This article conducted a lot of research on the entity search work. The data characteristics of different sources including the surface network, social network and the enterprise internal network are analyzed. The effective solutions are proposed for the entity naming ambiguity and the entity alias problem respectively. In addition, based on the solution of the entity homonym disambiguation proposed in this paper, we have issued a character search system. In these studies, we focus on the analysis of unstructured text, and make full use of the Natural Language Processing method to explore the word, entity, structure and content of the text by using the Natural Language Processing method, so as to connect the information through the data mining algorithm. Solving the problems encountered in entity search and entity analysis. The main contributions of this paper are as follows:
1. entity search overview. This paper introduces the problems encountered in the entity search and the technical methods used, briefly describes the existing name search system, the related problems of human name search and the future research direction.
The 2. entity is the same name disambiguation. Taking the name disambiguation as an example, we use the Natural Language Processing tool to extract the unstructured documents returned by the search engine. The extracted entity is used as the character label to establish the graph structure based on the entity label. Finally, the entity labels are assigned to the different people with the same name. In addition, we have developed a name search system that uses a given name as a query word to enter the existing search engine (that is, Google, YAHOO or Bing), using the disambiguation method we proposed to disambiguate the returned results, so that users can clearly see the different personages who have the names of the people. Key entity information.
3. entity alias discovery. This paper studies the two cases of string similarity and non string similarity between entity and alias. For the first case, we first extract the alias candidate based on the character similarity, and then establish the entity relation graph to choose other names. There is basically no word for the alias and the original entity. In the case of character similarity, the research work faces more challenges. This paper proposes the selection of alias candidates based on the entity subset segmentation method, and then determines the final alias by the active learning classification method. In general, the entity alias discovery method of this paper is aimed at exploring the given data centralization entity. The initial filtering method is designed to extract the alias candidate of a given entity, and then the unsupervised / supervised method is used to explore the correlation between the given entity and the alias candidate, and then a list of aliases is output for each given entity.
4. dynamic entity alias discovery. As the new data is added to a given data set, the entity relational graph structure based on this dataset also needs to be updated (insertion, deletion and modification), and the previous static solutions are no longer applicable to such dynamic environments. Therefore, this paper proposes a solid cable based on the entity cable. The path search method is used to realize the updating of dynamic graph and apply the dynamic solution to the incremental entity alias detection problem.
【学位授予单位】:兰州大学
【学位级别】:博士
【学位授予年份】:2012
【分类号】:TP391.3
【共引文献】
相关期刊论文 前4条
1 庞雄文;姚占林;李拥军;;大数据量的高效重复记录检测方法[J];华中科技大学学报(自然科学版);2010年02期
2 赵军;;命名实体识别、排歧和跨语言关联[J];中文信息学报;2009年02期
3 张岩;杨龙;王宏志;;劣质数据库上阈值相似连接结果大小估计[J];计算机学报;2012年10期
4 李琦;马军;;基于人物相关社区的重名消解研究[J];山东大学学报(理学版);2012年03期
相关博士学位论文 前2条
1 周春英;超数据集成挖掘方法与技术研究[D];浙江大学;2012年
2 张永新;面向Web数据集成的数据融合问题研究[D];山东大学;2012年
相关硕士学位论文 前6条
1 赵飞国;面向数据挖掘的数据预处理系统设计与实现[D];北京交通大学;2011年
2 徐凯丰;中文语义万维网本体匹配[D];上海交通大学;2011年
3 徐锐波;应用于搜索引擎的人物分类系统设计与实现[D];华中科技大学;2011年
4 裴飞;基于聚类的英汉人名消歧研究[D];苏州大学;2011年
5 王峰;同名排歧方法研究及其应用[D];清华大学;2009年
6 傅临云;数据万维网自动实体匹配[D];上海交通大学;2010年
,本文编号:2058714
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2058714.html