微博舆情系统中数据采集技术研究

发布时间：2018-04-18 22:10

本文选题：微博数据 + 模拟登录　；参考：《湘潭大学》2014年硕士论文

【摘要】：随着互联网的成熟和移动互联网的快速发展，越来越多的信息都被发布在网络上，而且这种方式也逐渐的被大众接受。网络上的信息在一定程度上能反映民众意向，但同时一些蛊惑性的话也能煽动网民，因此网络舆论在当下社会中越来越受关注。为发展健康的网络环境，有关政府部门需要对网络舆情进行有效预测、发现和疏通引导。而在网络舆情领域中，，微博舆情备受关注，因为越来越多的舆情事件都是首先在微博上曝光，然后在微博上传播、讨论从而形成舆情事件。从各级政府、企事业单位开通微博的动作就能看出微博在网络中的地位。本文针对微博舆情系统中数据采集存在的若干问题进行分析与研究，提出了通过模拟登录采集网页，然后辅以优先队列采来集微博上更有影响力的微博。本文主要完成以下工作：（1）就目前常用三种方法进行分析：微博推送、基于微博API和网络爬虫。前两种采集方法很难满足舆情系统对微博数据在规模和实时性等方面的需求，最后一种则不容易采集到有用信息。为此，本文提出模拟浏览器登录微博抓取网页数据的方法，以方便地获取任意微博用户网页上的数据，并且能避开前两种方法在数据采集速度上的限制。（2）考虑到微博上用户数目庞大，采集数据时会漏掉很多用户。本文提出构建微博用户网络的方法来解决该问题。首先，将每个微博用户抽象为一个点，用户和用户之间的粉丝、关注、转发、评论等关系抽象为边，将每种关系的量化值作为该边上对应关系权值。通过点和边加入，就能构建出一个巨大的微博用户网络，这样就能通过这个网络不断的发现新微博用户，进而能保证数据的完整性。（3）为实现高效的微博数据采集，本文采用优先队列算法。高效采集数据是指在面对大量的数据时，我们分层次的采集这些数据，即先采集影响力大的用户所发的微博，然后才是影响力较小的。为实现该功能，本文设计了优先级的计算模型。综合新浪微博对影响力用户的定义和各种实际情况，筛选出粉丝数、关注数、活跃度、传播力和时间戳这五个因子。以影响力为主要因子构建优先队列，使得影响力越大的用户数据采集频率越高，同时还通过计算时间间隔兼顾非活跃用户的数据获取。并且，在获得网页后，由于微博的网页结构单一，本文设计了相应的去噪、解析方法，即通过固定特征值直接定位有效信息，实现高效解析。对得到的数据，对其进行简单的数据分析，得到一些简单有意思的信息。实验结果表明该方法具有通用性强、完全无需人工干预、获取信息的质量高、速度快等优点。
[Abstract]:With the maturity of the Internet and the rapid development of mobile Internet, more and more information are published on the network, and this way is gradually accepted by the public.The information on the network can reflect the public intention to some extent, but at the same time some demagoguery words can also incite the netizen, so the network public opinion is paid more and more attention in the present society.In order to develop a healthy network environment, relevant government departments need to make effective prediction, discovery and guidance of network public opinion.In the field of network public opinion, Weibo's public opinion is concerned, because more and more public opinion events are first exposed on Weibo, and then spread on Weibo to discuss the formation of public opinion events.From all levels of government, enterprises and institutions to open Weibo's actions can see the status of Weibo in the network.This paper analyzes and studies some problems existing in data acquisition in Weibo's public opinion system, and puts forward the idea of collecting web pages by simulating login, and then using priority queue to collect the more influential Weibo on Weibo.The main work of this paper is as follows:This paper analyzes three methods used at present: Weibo push, Weibo API and web crawler.The first two methods are difficult to meet the demand of the public opinion system for Weibo data in scale and real-time. The last one is not easy to collect useful information.For this reason, this paper proposes a method of imitating browser login Weibo to grab web page data, so as to obtain data on any user's page easily, and to avoid the limitation of data acquisition speed of the former two methods.Considering Weibo's large number of users, many users will be left out when collecting data.This paper puts forward the method of constructing Weibo user network to solve this problem.First of all, each Weibo user is abstracted as a point, the relationship between user and user, attention, forwarding, comment and so on are abstracted as edges, and the quantization value of each relationship is regarded as the corresponding relation weight value of each kind of relationship.By adding dots and edges, we can construct a huge Weibo user network, which can continuously discover new Weibo users and ensure the integrity of the data.In order to achieve efficient Weibo data acquisition, priority queue algorithm is adopted in this paper.Efficient data acquisition means that in the face of a large number of data, we collect these data at different levels, that is to say, we first collect Weibo, who has great influence, and then we have less influence.In order to realize this function, the priority calculation model is designed in this paper.Synthesizing Sina Weibo's definition of influential user and all kinds of actual situation, the five factors of fan number, attention number, activity degree, propagation power and time stamp are screened out.With the influence as the main factor, the priority queue is constructed, which makes the more influential user data acquisition frequency higher, but also through calculating the time interval to take account of inactive users data acquisition.After obtaining the web page, due to the single structure of Weibo's web page, the corresponding denoising and parsing method is designed in this paper, that is, the effective information can be directly located by fixed eigenvalues to achieve efficient parsing.For the obtained data, the simple data analysis, get some simple and interesting information.The experimental results show that this method has many advantages, such as high quality and high speed.
【学位授予单位】：湘潭大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【相似文献】