支持云计算的微博在线采集方法研究与应用

发布时间：2018-01-01 06:18

本文关键词：支持云计算的微博在线采集方法研究与应用　出处：《燕山大学》2014年硕士论文　论文类型：学位论文

【摘要】：Web2.0时代的到来，不仅改变了我们对传统互联网的使用习惯，更是掀起了Web时代的新变革。作为社交网络和移动互联网的典型代表——新浪微博拥有5亿多注册用户，庞大的用户群体和每天产生的海量数据集使得一个真正的双向传播和新媒体时代初具规模。本文针对微博数据的在线采集，分析了传统网络爬虫采集的局限性及国内外现有研究及设计方案的优劣后，提出了支持云计算扩展的微博网络爬虫设计方案，研究设计基于HTTP协议通信数据包分析，分布式计算及Hadoop分布式文件系统HDFS的技术原理。具体研究的问题有以下几个方面：首先，分析了Web2.0网络应用在线数据采集的研究现状和局限性，提出以模拟浏览器方式登录微博，解决由登录问题导致信息无法采集的问题，分析现有oAuth授权调用微博API方式获取信息方案的不足，提出以模拟浏览器方式访问的网络爬虫方法进行微博数据的在线采集。然后，对于微博产生庞大的数据量，，在评估了重构Nutch搜索引擎框架中传统网络爬虫采集、解析功能的风险后，依据分布式计算原理，提出了分布式微博爬虫的架构，并根据此架构详细介绍了各模块间的核心业务逻辑。再次，进一步扩展了分布式微博爬虫的功能，提出了两种工作模式：普通模式和云计算扩展模式。其中普通模式Web信息抽取工作依据正则表达式和BeautifulSoup框架提供的XML检索接口完成；云计算扩展模式则提出了支持Hadoop分布式文件系统HDFS。扩展模式产生键值对形式的采集数据，并将资源副本输出到HDFS上，实质为MapReduce计算框架提供了文件输入端。最后，实现了上述的功能模块，并进行了验证。
[Abstract]:The arrival of Web2.0 not only changes our habit of using traditional Internet. It is also a new revolution in the Web era. As a typical representative of social networks and mobile Internet, Sina Weibo has more than 500 million registered users. The huge user group and the massive data set produced every day make a real bidirectional communication and new media era take shape. This paper focuses on the online acquisition of Weibo data. After analyzing the limitations of traditional crawler collection and the advantages and disadvantages of the existing research and design schemes at home and abroad, a Weibo crawler design scheme to support cloud computing expansion is proposed. The technical principle of communication packet analysis, distributed computing and Hadoop distributed file system (HDFS) based on HTTP protocol is studied and designed. First of all, this paper analyzes the research status and limitation of online data acquisition in Web2.0 network application, and puts forward the method of simulating browser to log on to Weibo to solve the problem that information can not be collected caused by login problem. This paper analyzes the shortcomings of the existing oAuth authorization to call Weibo API to obtain information, and proposes a web crawler method which simulates browser access to carry out online Weibo data acquisition. Then, for Weibo to produce a large amount of data, after evaluating the risk of traditional crawler collection and parsing function in the framework of reconfigurable Nutch search engine, according to the principle of distributed computing. The architecture of distributed Weibo crawler is proposed and the core business logic among modules is introduced in detail according to this architecture. Thirdly, the function of distributed Weibo crawler is further expanded. Two modes of work are proposed:. General schema and cloud computing extended pattern, in which the Web information extraction of common schema is based on the XML retrieval interface provided by regular expression and BeautifulSoup framework. Cloud computing extension mode proposed to support the Hadoop distributed file system HDFS.Extensible schema generates data in the form of key-value pairs and outputs copies of resources to HDFS. In essence, it provides the file input for the MapReduce computing framework. Finally, the above functional modules are implemented and verified.
【学位授予单位】：燕山大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】