微博舆情分析中的网页结构化信息抽取技术研究

发布时间：2018-05-26 16:56

本文选题：微博 + 舆情　；参考：《北京邮电大学》2014年硕士论文

【摘要】：微博是一种基于用户关系的信息获取,分享和传播的平台。作为时下因特网中最流行的社交工具之一,微博在为人们带来便捷的同时,也正在成为虚假信息滋生和泛滥的温床。因此,针对微博的舆情监测对于国家政府以及网络监管部门来说是十分必要的。为了能够对微博这一重要的舆情源进行全局有效的分析,我们需要同时获取当前流行的多个微博站点的微博,并获取每条微博的作者,正文,评论数,转发数等结构化信息。针对此目的,本文提出了一种统一的基于层次聚类的微博网页结构化信息抽取方法。该方法可以在不借助业务提供商的API的情况下,从使用网络爬虫爬取的任意微博业务提供商的微博网页中逐条采集微博的结构化信息,为实现跨站点的全局性微博舆情分析奠定基础。本文的主要工作如下：1)研究了典型的微博舆情分析系统所分析的舆情指标以及系统架构,并提出了微博舆情分析系统对于微博网页结构化信息抽取模块的要求。2)在上述的工作的基础上,提出了一种统一的基于层次聚类的微博网页结构化信息抽取方法。该方法充分地考虑了微博网页所独有的DOM树结构,克服了一些目前通用的Web信息抽取方法所具有的计算量大、对微博网页正文体抽取不准确的问题,能够高效地、准确地抽取出微博网页中的结构化信息。3)利用本文提出的方法对多家微博网站的网页的进行了抽取实验,并尝试在一个微博舆情分析实验系统中使用该方法。这些实验表明,本文提出的方法具有很高的准确性,并且能够满足微博舆情分析系统对于微博网页结构化信息抽取模块的要求。
[Abstract]:Weibo is a user-based information acquisition, sharing and dissemination platform. As one of the most popular social tools on the Internet, Weibo is not only bringing convenience to people, but also becoming the breeding ground of false information. Therefore, public opinion monitoring for Weibo is very necessary for national government and network supervision department. In order to analyze Weibo as an important source of public opinion globally and effectively, we need to obtain the Weibo of several popular Weibo sites at the same time, and obtain the author, text, comment number, forwarding number and other structured information of each Weibo. For this purpose, a unified hierarchical clustering method for extracting structured information from Weibo pages is proposed in this paper. This method can collect the structured information of Weibo from the Weibo pages of any Weibo service provider crawled by a web crawler without the help of the API of the service provider. For the realization of cross-site global Weibo public opinion analysis laid the foundation. The main work of this paper is as follows: 1) the public opinion index and the system structure of the typical Weibo public opinion analysis system are studied. On the basis of the above work, a unified hierarchical clustering method for extracting structured information from Weibo pages is proposed. This method fully takes into account the unique DOM tree structure of Weibo web pages, overcomes the large computational complexity of some current Web information extraction methods, and it can efficiently extract the positive style of Weibo pages. Extract the structured information from Weibo web pages accurately. 3) We use the method proposed in this paper to extract the web pages of many Weibo websites, and try to use this method in a Weibo public opinion analysis experiment system. These experiments show that the method proposed in this paper has high accuracy and can meet the requirements of Weibo public opinion analysis system for the structural information extraction module of Weibo web pages.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】