微博实体与百科条目链接的多策略研究

发布时间：2018-07-14 12:13

【摘要】：近年来,随着WEB2.0技术及互联网产业的兴起,社交网络空前发展,衍生出的一种新型社交网络平台,微博,其用户规模和产生的数据量急剧增长。另一方面,WEB2.0技术也带来了网络百科的迅速发展,如何利用社交媒体及网络内容进行知识库的构建与扩展成为当今研究热点。其中,待拓展实体条目的歧义问题成为该研究领域的重点难点,实体链接技术是解决该问题的重要方法。本文针对中文微博内容简短、语言随意不规范等特性,提出了中文微博实体链接消歧的多策略方法。中文微博实体与百科条目的链接,即对微博内容中出现的待测命名实体与百科知识库中的条目进行匹配,要求将微博中出现的实体与百科条目准确链接。中文微博实体与百科条目的链接研究从属于命名实体识别(Named Entity Recognition,NER)下命名实体消歧(NED,Named Entity Disambiguation)研究课题,是自然语言处理(NLP,Natural Language Processing)研究领域中的一项热点研究,在自然语言处理的研究领域中起重要作用,是不可缺少的研究基础。提升中文微博实体链接消歧的准确性,可以更好地构建与扩展网络百科知识库,体现自然语言处理系统的通用性高与性能好的特点。本文以参加的中国计算机学会(CCF,China Computer Federation)主办的自然语言处理与中文计算会议(NLPCC, CCF Conference on Natural Language Processing Chinese Computing)的评测任务为主要研究内容。编写网页爬虫程序,获取微博内容及网络百科页面信息,构建百科实体映射表及梳理百科条目知识库。使用LDA模型,基于主题模型的消歧算法对人名实体进行消歧。集合基于实体映射表的匹配消歧算法、基于TF-IDF的实体义项特征消歧算法、基于实体义项标签的消歧算法和基于Fast-Newman聚类模型实体消歧算法对中文微博实体进行消歧,本文主要贡献包括：(1) 构建和梳理百科条目知识库及实体映射表。(2) 提出基于主题模型的人名消歧算法。(3) 提出多层级、多策略的实体消歧算法。(4) 编写中文微博实体识别系统和百科知识库程序,并申请软件著作权。本文数据来源于第二届和第三届自然语言处理与中文计算会议(NLPCC 2013、2014)中的中文微博实体链接任务,其中在2013年评测中,知识库实体数为44492个,待测实体数为1274个。在2014年评测中,知识库实体数为378207个,待测实体数为607个。评测成绩2013年准确率为84.99%,在全国提交的18组结果中排名第6和第7,队伍成绩排名第3。2014年准确率为84.02%,队伍排名第3。经过后续总结改进,采用本文的模型和算法,准确率达91.40%。
[Abstract]:In recent years, with the rise of Web 2.0 technology and the Internet industry, the social network has developed unprecedentedly. A new type of social network platform, Weibo, has been developed, its user size and the amount of data generated has increased rapidly. On the other hand, Web 2.0 technology has also brought the rapid development of online encyclopedia, how to use social media and network content to build and expand the knowledge base has become a hot research topic. Among them, the ambiguity of entity items to be expanded has become a key problem in this research field, and entity link technology is an important method to solve this problem. In this paper, a multi-strategy method of entity link disambiguation for Chinese Weibo is proposed, which is short in content and irregular in language. The link between Chinese Weibo entities and encyclopedia entries, that is, matching the named entities to be tested in the Weibo content with the entries in the encyclopedia knowledge base, requires that the entities appearing in Weibo be accurately linked to the encyclopedia entries. The research on the link between Chinese Weibo entities and encyclopedia entries belongs to the research topic of named entity disambiguation under named entity recognition, which is a hot topic in the field of natural language processing. It plays an important role in the field of natural language processing and is an indispensable research foundation. By improving the accuracy of Chinese Weibo entity link disambiguation, the network encyclopedia knowledge base can be constructed and expanded better, which embodies the characteristics of high generality and good performance of the natural language processing system. This paper focuses on the evaluation task of the CCF Conference on Natural language processing Chinese Computing organized by the CCF China computer Federation. A web crawler program is written to obtain Weibo content and web page information, to construct an encyclopedia entity mapping table and to comb the knowledge base of encyclopedia items. Using LDA model, the disambiguation algorithm based on topic model is used to disambiguate human name entity. The matching disambiguation algorithm based on entity mapping table, entity meaning feature disambiguation algorithm based on TF-IDF, entity sense label based disambiguation algorithm and entity disambiguation algorithm based on Fast-Newman clustering model are used to disambiguate Chinese Weibo entities. The main contributions of this paper are as follows: (1) constructing and combing the knowledge base of encyclopedic entries and entity mapping tables; (2) proposing a topic model-based algorithm for disambiguation of human names; (3) proposing a multi-level method for disambiguation. Multi-strategy entity disambiguation algorithm. (4) write Chinese Weibo entity recognition system and encyclopedic knowledge base program and apply for software copyright. The data in this paper are derived from the task of linking Chinese Weibo entities in the second and third Natural language processing and Chinese Computing conferences (NLPCC2013 / 2014). In the 2013 evaluation, the number of entities in the knowledge base is 44492 and the number of entities to be tested is 1274. In 2014, the number of entities in knowledge base is 378207 and the number of entities to be tested is 607. The accuracy rate in 2013 was 84.99, ranked 6th and 7th in 18 groups of results submitted by the country, ranked 3.02accuracy rate in 2014, and ranked 3th in team. After the following summary and improvement, the model and algorithm are adopted, the accuracy is 91.40.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.1;TP393.092

【参考文献】