当前位置:主页 > 科技论文 > 搜索引擎论文 >

中文微博实体链接方法研究

发布时间:2018-07-07 19:35

  本文选题:微博 + 实体链接 ; 参考:《哈尔滨工业大学》2013年硕士论文


【摘要】:网络资源的不断膨胀,信息的不断增多使得人们获取有价值的信息变得越来越困难。而微博的发展和流行,使得人们更加无法从短文本中获取更多感兴趣的内容。课题组为解决这一问题,开发了知识拓展与推荐平台,为用户感兴趣的信息提供更多的拓展信息。而待拓展知识条目的歧义性成为系统性能的瓶颈。实体链接技术是解决该问题的重要方法,它让程序自动确定上下文中出现的某个实体指称应该指向的真实世界中的哪个实体,从而实现消歧。针对中文微博这一短文本领域的实体链接任务,主要进行了以下几个方面的工作: 为获取充足的微博语料,本课题首先实现了网页微博爬虫程序,相比于API的获取方式,大大提高了获取效率,同时获取了大量的微博语料,并进行了相应预处理工作。 候选实体的获取是实体链接的关键,针对任一待消歧实体,提出了多种不同方式获取的候选实体,分别赋予了不同的权重以去除噪声提高消歧的准确性。候选知识库信息的获取则主要来自维基百科和百度百科,对于百科中不存在的词汇,则调用一个元搜索对网络上的信息进行整合,完成信息的获取。而针对微博语料的特征稀疏问题,首先利用用户简介信息、标签及近期微博进行拓展;然后提取微博中的关键词获取Google、百度、Bing等搜索引擎的结果进行拓展。 实现了基于多渠道候选实体的实体链接算法和基于领域词库的实体链接算法。通过各种方法的对比,算法在NLPCC2013评测公开数据集上能够给出较为理想的准确值。本课题最后基于新浪微博开放平台搭建了知识拓展与推荐的应用系统。本课题算法处理的结果在系统运行的结果显示,可以达到预期的效果。
[Abstract]:With the expansion of network resources and the increasing of information, it becomes more and more difficult for people to obtain valuable information. With the development and popularity of Weibo, people can not get more interesting content from short text. In order to solve this problem, a knowledge extension and recommendation platform was developed to provide more information for users. The ambiguity of knowledge items to be expanded becomes the bottleneck of system performance. Entity link technology is an important method to solve this problem. It allows the program to automatically determine which entity the entity reference in the context should point to in the real world, so as to achieve disambiguation. In view of the entity link task of Chinese Weibo in this field, the main work is as follows: in order to obtain sufficient Weibo corpus, this paper first implements the web page Weibo crawler program. Compared with the way of obtaining the Weibo, the efficiency of the acquisition is greatly improved, and a large number of Weibo corpus is obtained, and the corresponding pretreatment work is carried out. The acquisition of candidate entities is the key of entity link. For any entity to be disambiguated, a variety of candidate entities are proposed, which are given different weights to remove noise to improve the accuracy of disambiguation. The candidate knowledge base information is obtained mainly from Wikipedia and Baidu encyclopedia. For the words that do not exist in the encyclopedia, a meta-search is called to integrate the information on the network to complete the information acquisition. To solve the problem of sparse features of Weibo corpus, the user profile, tags and recent Weibo are used to expand, and then the keywords from Weibo are extracted to obtain the results of search engines such as Google, Baidu and Bing. An entity link algorithm based on multi-channel candidate entities and an entity link algorithm based on domain lexicon are implemented. Through the comparison of various methods, the algorithm can give a more ideal accurate value on the NLPCC2013 evaluation and open data set. In the end, the application system of knowledge extension and recommendation is built based on Sina Weibo open platform. The results of the algorithm processing in the system show that the expected results can be achieved.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP393.092

【参考文献】

相关期刊论文 前5条

1 王国霞;刘贺平;;个性化推荐系统综述[J];计算机工程与应用;2012年07期

2 彭泽映;俞晓明;许洪波;刘春阳;;大规模短文本的不完全聚类[J];中文信息学报;2011年01期

3 赵军;刘康;周光有;蔡黎;;开放式文本信息抽取[J];中文信息学报;2011年06期

4 张剑峰;夏云庆;姚建民;;微博文本处理研究综述[J];中文信息学报;2012年04期

5 许棣华;王志坚;林巧民;黄卫东;;一种基于偏好的个性化标签推荐系统[J];计算机应用研究;2011年07期



本文编号:2106036

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2106036.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户9c8f5***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com