微博网络爬行器技术研究与实现

发布时间：2018-05-19 20:29

本文选题：网络爬行器 + Xpath抽取　；参考：《吉林大学》2013年硕士论文

【摘要】：随着移动通信网络和Web2.0技术的不断发展，微博已逐渐成为人们日常交流、通信、娱乐的基本工具。越来越多的人开始使用并利用微博来传播广告、新闻、话题等信息。同时由于微博的开放性和匿名性，微博也隐藏着许多不良信息，如谣言、暴力以及反动信息，这对我国舆论的引导与监管带来了很大的困难。因此，针对微博网络开展数据采集工作研究，既是对微博网络信息传播建模与优化的研究基础，也是微博网络舆情监控与分析的必要前提，具有十分重要的研究意义与实践意义。本论文主要以新浪微博为研究对象，在调研了当前主流爬行器技术的基础上，设计并实现了一个高效地增量式微博网络爬行器，主要工作如下： 1、根据信息抽取的需求，分析了新浪微博信息结构的组成，采集用户的基本信息、用户的标签与关注的话题、用户的社交关系（关注、粉丝）及其所发的微博等，根据所要抽取的信息并设计了相应的数据库，在具体采集信息时，本文采用模拟浏览器的策略访问微博用户的主页，并将采集下的网页源码转成文档对象模型树，，采用Xpath表达式对转化后的文档对象模型结构化信息进行的抽取，在数据存储时采用软件工程的思想，在底层使用了Hibernate和Spring的数据持久化技术进行数据存储，这样能够屏蔽数据访问和存储的细节。 2、在具体地设计中，文中较好的实现了自动填写表单技术，自动填写表单主要是采用抓包软件破解新浪微博登陆的加密协议，并模拟浏览器填写表单登陆新浪微博，获得新浪微博服务器返回的cookie，利用这些cookie进行下载用户的相关网页。为了能够高效持续的采集用户的相关信息，本文设计并实现了基于多生产者多消费者模型的网页信息采集与存储的网络爬行器，将爬行器的采集端类比成生产者，即不断地持续地从新浪微博服务器中下载网页并解析成结构化数据，将爬行器的存储端类比成消费者，采用多线程的方式分别对每类结构化的数据进行存储。为了进一步提高爬行器的效率，文中利用新浪微博API接口对微博用户的社交信息进行辅助采集。 3、本文深入研究了微博网络爬行策略的问题，由于每个用户的发表博文的频率并不一致，如果毫无区别地对微博用户轮询采集会浪费大量的带宽和网络资源，因此本文提出了基于用户活跃度的爬行调度策略，利用所采集的用户的微博时间数据对用户的活跃度进行预测，采用时间序列分析方法预测用户在下一个时间段内博文的发表量，发表量越多用户的活跃度也越大，爬行器按照用户的活跃程度进行调度，用户越活跃爬行器采集的频率越大，实验结果表明本文的采集策略比简单的深度优先爬行相比其覆盖率和时效性都有了明显的提高。
[Abstract]:With the development of mobile communication network and Web2.0 technology, Weibo has become a basic tool for people's daily communication, communication and entertainment. More and more people begin to use and use Weibo to spread information about advertising, news, topics, etc. At the same time, because of the openness and anonymity of Weibo, Weibo also hides a lot of bad information, such as rumors, violence and reactionary information, which brings great difficulties to the guidance and supervision of public opinion in China. Therefore, the research on data acquisition based on Weibo network is not only the research foundation of modeling and optimization of Weibo network information dissemination, but also the necessary premise of monitoring and analysis of Weibo network public opinion. Has the very important research significance and the practice significance. This paper mainly takes Sina Weibo as the research object, on the basis of investigating the current mainstream crawler technology, designs and implements an efficient incremental Weibo crawler. The main work is as follows: 1. According to the demand of information extraction, this paper analyzes the composition of Sina Weibo information structure, the collection of basic information of users, user tags and topics of concern, the social relationship of users (attention, fans) and their Weibo, etc. According to the information to be extracted and the design of the corresponding database, this paper uses the strategy of simulating browser to visit the home page of Weibo users, and converts the source code of web pages to document object model tree. The Xpath expression is used to extract the structured information of the transformed document object model, the idea of software engineering is used to store the data, and the data persistence technology of Hibernate and Spring is used to store the data at the bottom. This masked details of data access and storage. 2, in the concrete design, the paper has realized the automatic form filling technology, the automatic filling form mainly uses the capture package software to break the encryption protocol of Sina Weibo login, and simulates the browser to fill in the form login Sina Weibo, Get the cookie returned by Sina Weibo server, and download the relevant web pages by using these cookie. In order to collect user information efficiently and continuously, a web crawler based on multi-producer and multi-consumer model is designed and implemented in this paper. That is to continuously download web pages from Sina Weibo server and parse them into structured data, compare the storage end of the crawler to consumers, and store each kind of structured data separately by multithreading. In order to further improve the efficiency of the crawler, this paper uses Sina Weibo API interface to collect the social information of Weibo users. 3. In this paper, the problem of Weibo crawling strategy is deeply studied. Because the frequency of publishing blog is not the same for each user, it will waste a lot of bandwidth and network resources to poll Weibo users indiscriminately. Therefore, a crawling scheduling strategy based on user activity is proposed in this paper, and the user activity is predicted by the collected Weibo time data. Time series analysis is used to predict the amount of blog posts published by users in the next time period. The more users publish, the greater the activity of users, and the more active crawlers are scheduled according to the active degree of users, the greater the frequency of crawler collection is. The experimental results show that compared with the simple depth-first crawling, the acquisition strategy in this paper has significantly improved the coverage and timeliness.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP393.092

【参考文献】