基于两阶段聚类的人名消歧算法研究

发布时间：2018-02-27 19:17

本文关键词： 人名消歧属性抽取语义关系图聚类　出处：《东北大学》2012年硕士论文　论文类型：学位论文

【摘要】：随着互联网的普及,向搜索引擎提交查询进行信息检索已经成为人们获取网络信息的主要方法。人名检索是最常见的检索之一,通过搜索引擎可以很方便的获取一个人物的信息,但是由于人名重复现象十分普遍,以至于对于一个人名的检索,搜索引擎常常返回一个很长的结果列表,包含了许多重名者。用户要想找到特定的人物信息,必须通过添加特征来改善查询,或者通过浏览的方式在结果列表中进行搜索,从众多重名者的信息中找到想要查询的人物信息,这样会使搜索性能大大下降。因此,有必要研究一种有效的人名消歧算法来提高人名检索效率。本文在分析现有人名消歧相关理论与技术的基础上,提出了两阶段聚类的人名消歧方法。人物属性是对人名消歧很重要的特征,首先,本文抽取了16种主要的人物属性,对于9种比较容易抽取的属性,采用传统正则表达模式和词典匹配的方法,而针对7种抽取比较困难的属性,采用一种基于自扩展的自动化抽取方法；然后,本文将搜索引擎返回的结果文档用属性向量表示,计算文档之间的相似度；最后进行初步聚类。由于并非所有的网页中都包含人物属性信息；因此初步聚类之后许多没有包含人物属性信息的网页不能被正确聚类。因此,本文提出了利用语义关系进行再次聚类的方法。首先,本文抽取维基百科中概念及概念之间语义关系,并对语义关系进行计算,构建语义关系图；其次,使用SimRank算法计算出任意两个节点之间的相似度；然后将初步聚类的结果表示成维基百科概念向量；最后,根据概念语义关系计算簇之间相似度,进行第二次人名聚类。实验结果证明了我们所提出的两阶段聚类相结合的人名消歧算法在准确率和召回率上都有显著提升,并且比先前的方法性能更优。证明了本文提出的算法对人名消歧问题的解决是有效的。
[Abstract]:With the popularity of the Internet, submitting queries to search engines for information retrieval has become the main method for people to obtain network information. It is easy to get information about a person through search engines, but because the repetition of names is so common, search engines often return a long list of results for a search of a person's name. To find specific personas, users must improve queries by adding features, or search the results list by browsing. It is necessary to study an effective name disambiguation algorithm to improve the efficiency of human name retrieval. Based on the analysis of the existing theories and techniques of name disambiguation, this paper proposes a two-stage clustering method for disambiguation of human names. The character attribute is a very important feature for the disambiguation of a person's name. Firstly, 16 kinds of main character attributes are selected in this paper. For the 9 attributes which are easy to extract, the traditional canonical expression pattern and dictionary matching method are adopted, while for the seven kinds of attributes which are more difficult to extract, an automatic extraction method based on self-expansion is adopted. In this paper, the result document returned by search engine is represented by attribute vector, and the similarity between documents is calculated. Therefore, after the initial clustering, many web pages that do not contain the attribute information of people can not be correctly clustered. Therefore, this paper proposes a method of re-clustering using semantic relations. This paper extracts concepts from Wikipedia and their semantic relations, calculates semantic relations and constructs semantic relationship diagrams. Secondly, the similarity between any two nodes is calculated by using SimRank algorithm. Then the results of the initial clustering are expressed as the concept vector of Wikipedia. Finally, the similarity between clusters is calculated according to the semantic relationship of concepts, and the second clustering of names is carried out. The experimental results show that the proposed two-stage clustering algorithm can significantly improve the accuracy and recall rate of human name disambiguation. The performance of the proposed method is better than that of the previous method. It is proved that the proposed algorithm is effective in solving the name disambiguation problem.
【学位授予单位】：东北大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】