中文人名搜索引擎关键技术研究

发布时间：2018-04-23 08:29

本文选题：搜索引擎 + 中文人名搜索　；参考：《河北大学》2012年硕士论文

【摘要】：人名歧义是由于现实中同一姓名可能被多个实体人物共同使用而带来的一种身份不确定现象。中文人名搜索是互联网用户日常需求之一。随着Internet的发展，Web页面中因人物同名而带来阅读理解困难的问题越来突出，尤其给搜索引擎带来了不利影响。目前流行的通用搜索引擎对歧义人名仅通过关键字匹配、Web页面热度排序，输出长而无序的列表。真正有价值的信息仅为海量Web数据中的“冰山一角”，同时有“名人”网页淹没“非名人”网页的现象，给用户查找其所需要的人物信息带来了极大不便。本文针对中文人名搜索这一问题进行研究，主要工作如下：首先在对垂直搜索引擎技术的研究基础之上，结合中文人名搜索的特点，设计出中文人名搜索引擎体系结构。其中，Web人名主题爬虫采用基于模板和基于网页DOM树分析两种方法，分别从百度人物百科采集人物信息建立人物资料库和互联网采集包含歧义人名的Web页面，构建人物知识库和待消歧Web页面库。在Web人名消歧方面，本文给出了一种基于百度百科的无监督自动人名消歧方法。采用百度人物百科的海量数据作为基础人物资料库，通过解析其丰富的人物信息和语义关系，提炼出人物背景知识、人物特征语境、人物群体信息3大特征并进行线性融合，选取最大值所对应的实体人物作为歧义人名所指人物，作为Web页面索引建立依据。最后，，本文建立实验原型并进行了Web中文人名消歧实验，取得了较好的消歧效果，验证了该方法的有效性。
[Abstract]:Name ambiguity is a kind of identity uncertainty caused by the fact that the same name may be used by many entities. Chinese name search is one of the daily needs of Internet users. With the development of Internet, the problem of reading and understanding is becoming more and more prominent, especially to the search engine. At present, the popular general search engine sorts the ambiguous names only by keyword matching, and outputs a long and unordered list. The truly valuable information is only the "tip of the iceberg" in the massive Web data, and the phenomenon of "celebrity" page flooding the "non-celebrity" web page, which brings great inconvenience to the user to find the person information he needs. The main work of this paper is as follows: Firstly, based on the research of vertical search engine technology and the characteristics of Chinese name search, the architecture of Chinese person name search engine is designed. Among them, the web name subject crawler adopts two methods: template based method and web page DOM tree analysis method, which collect the character information from Baidu's encyclopedia personae to set up the character database and collect the Web page with ambiguous names on the Internet, respectively. Build character knowledge base and Web page library to be disambiguated. In the aspect of Web name disambiguation, this paper presents an unsupervised automatic name disambiguation method based on Baidu Encyclopedia. Using the massive data of Baidu's encyclopedia personae as the basic character database, through analyzing its abundant character information and semantic relation, the author abstracts out three characteristics of character background knowledge, character characteristic context, character group information and carries on linear fusion. The entity character corresponding to the maximum value is chosen as the character of ambiguous person name, which is used as the basis of Web page index. Finally, the experiment prototype is established and the Web Chinese name disambiguation experiment is carried out, and a good disambiguation effect is obtained, which verifies the effectiveness of the method.
【学位授予单位】：河北大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】