Web中相关实体发现研究

发布时间：2018-05-02 11:24

本文选题：相关实体发现 + 类型细化　；参考：《北京交通大学》2013年博士论文

【摘要】：随着Internet和信息检索技术的迅猛发展,Web成为人们获取信息的重要途径,而搜索引擎则成为从Web中获取信息的重要工具。传统的搜索方式是：用户向搜索引擎(比如Google、百度)提交查询,搜索引擎则依据提交的查询给用户返回一组相关文档列表。但是很多时候用户需求的并不是文档本身,而是文档中包含的实体信息。因此如何从众多的Web文档中找到用户需求的实体信息成为近年来的研究热点,而相关实体发现研究正是针对用户的这种特殊实体查询需求而产生。相关实体发现是指给定一个由源实体、目标类型和源实体与目标实体的关系描述构成的查询,找到符合要求的一组实体。返回的实体需要满足查询要求的类型,但是给定的目标类型经常非常粗糙,这导致无法对得到的实体进行准确的类型判断,针对这个问题我们做了如下的工作： 1)提出一种自动获取细粒度目标类型及其下义种子实体的方法。通过对查询语句的句法分析获取细粒度目标类型,利用查询模板获取目标类型的下义种子实体。 2)提出一种基于归纳法的细粒度目标类型下义类别判别规则集合获取方法,对于数量较少的种子实体,利用归纳法获取细粒度目标类型的下义类别判别规则集合。 3)提出一种基于特征提取的细粒度目标类型下义类别判别规则集合获取方法,对于数量较多的种子实体,利用学习到的最佳特征提取方法获取细粒度目标类型的下义类别判别规则集合。由于初始检索到的候选实体是无序的,要想得到满足用户查询要求的实体,必须对所有的候选实体进行排序,针对该问题我们做了如下的工作： 1)提出了一种基于生成概率模型的实体排序方法。从实体相关度、实体类型相关度和实体关系相关度三方面的组合计算来对实体进行排序,通过对比多种组合方法,获取最佳的排序方法。对于实体类型相关度的计算使用了两种方法,一种方法是基于归纳法获取的细粒度目标类型下义类别判别规则集合,利用不同的规则集合数进行实体类型相关度计算,另一种方法是基于特征提取方法获取的细粒度目标类型下义类别判别规则集合。对于实体关系相关度计算,评估了两种平滑方法对实体排序的影响,并提出了一种去停止词重构关系的实体关系相关度计算方法,提高了排序效果并降低了时间耗费。 2)提出了一种基于马尔可夫随机场的实体排序方法。该方法将实体用文档、类型和名称三个属性表示,利用学习到的最佳权重参数通过线性合并查询与候选实体表示文档的相关度、目标类型与候选实体类型的相关度以及源实体与候选实体名称的相关度来对实体进行排序。相关实体发现任务中,实体被定义为由其唯一的主页所表示,因此对所有的候选实体排序后,还要找到实体的主页。针对实体的主页查找问题,提出了一种查找方法,通过合并Web页面的多属性表示得分和实体的Wikipedia页面外部链接得分来实现实体的主页查找。实验结果表明,我们提出的方法可以有效的完成相关实体发现任务,大量的减少用户人工获取相关实体信息的工作,并为用户提供一个有效的结果。
[Abstract]:With the rapid development of Internet and information retrieval technology, Web has become an important way for people to obtain information, and search engines have become an important tool for obtaining information from Web. The traditional search method is: users submit queries to search engines (such as Google, Baidu), and search engines return a group of phases to users based on submission queries. Guan Wendang list. But most of the time the user needs not the document itself, but the entity information contained in the document. So how to find the entity information of the user needs from a large number of Web documents has become a hot spot of research in recent years, and the related entity discovery research is produced for the user's special entity query requirement. Closed entity discovery refers to a query consisting of a description of the source entity, the target type and the source entity, and a set of entities that meet the requirements.
The returned entity needs to meet the type of query requirements, but the given target type is often very rough, which leads to the inability to accurately type the obtained entity, and we do the following work for this problem:
1) a method of automatic acquisition of fine-grained target type and its underlying seed entity is proposed. By the syntactic analysis of query sentences, fine-grained target types are obtained, and a query template is used to obtain the underlying seed entity of the target type.
2) a method based on induction is proposed to obtain a set of fine category discriminant rule sets under fine grained target type. For a small number of seed entities, a set of lower sense category discriminant rules for fine-grained target types is obtained by induction.
3) a collection method based on feature extraction is proposed to obtain a set of semantic category discriminant rules set under fine grained target types. For a large number of seed entities, the best feature extraction method learned from learning is used to obtain a set of lower class discriminant rules for fine grained target types.
Since the initial retrieved candidate entities are unordered, to get the entity that meets the user's query requirements, all the candidate entities must be sorted. We have done the following work on the problem:
1) a kind of entity sorting method based on the generation probability model is proposed. The combination calculation of entity correlation degree, entity type correlation degree and entity relation correlation degree is used to sort the entity, and the best sorting method is obtained by comparing a variety of combination methods. Two methods are used for the calculation of entity type correlation. The method is a set of semantic category discrimination rules under the fine grained target type obtained by induction, and the correlation degree of entity type is calculated by different set of rule sets. The other is a set of semantic category discrimination rules under the fine-grained target type obtained by the feature extraction method. The evaluation of the correlation degree of entity relations is two. The effect of the smoothing method on the entity sorting is presented, and a method of calculating the correlation degree of the entity relation to stop the reconfiguration of the words is proposed, which improves the ranking effect and reduces the time consumption.
2) an entity sorting method based on Markov random field is proposed. This method represents the entity with three attributes of document, type and name, and the correlation degree of the document by linear merge query with the candidate entity, the correlation degree between the target type and the candidate entity type and the source entity and candidate. The correlation degree of the entity name is used to sort the entity.
In the related entity discovery task, the entity is defined as its unique home page, so after sorting all the candidate entities, the entity's main page is also found. A lookup method is proposed for the entity's home page finding problem by merging the multiple attribute table of the Web page and the external link score of the entity's Wikipedia page. To implement the home page lookup of the entity.
The experimental results show that the proposed method can effectively complete the related entity discovery tasks, reduce the work of the user to obtain the relevant entity information artificially, and provide an effective result for the user.

【学位授予单位】：北京交通大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP391.3

【共引文献】