动态页面数据采集方法的研究与分布式实现

发布时间：2019-06-22 19:20

【摘要】：当前,Web2.0快速发展,互联网上嵌有JavaScript脚本的动态页面所占比例越来越大,给页面数据采集工作带来了很大的困难。在网络舆论及搜索引擎研究领域,虽然页面数据采集的主要对象仍然为静态页面,但对动态页面中的数据进行采集的需求越来越迫切了。本文在研究了常用脚本解析环境、Hadoop分布式计算环境与分布式网络爬虫原理的基础上,提出了脚本解析环境的分布式构建方案。该方案将脚本解析环境嵌入到分布式网络爬虫中,实现了动态页面的数据采集。脚本解析环境的分布式设计方案包括三部分：脚本解析任务调度、脚本解析环境构建和脚本解析实现。脚本解析任务调度部分在研究Hadoop常用调度算法的基础上确定了脚本解析任务的MapReduce调度算法。脚本解析环境构建部分首先根据浏览器脚本引擎执行脚本片段的顺序和脚本片段在页面中的嵌入形式设计了脚本的解析流程和提取算法,然后提出了常用浏览器DOM对象与Rhino脚本解析引擎绑定的设计方案,完成了脚本解析环境的构建。脚本解析实现部分将脚本解析环境嵌入到分布式网络爬虫中,设计了脚本解析环境的整体文件架构和数据存储格式,并完成了脚本解析环境各个子模块的MapReduce实现。最后本文搭建了Hadoop分布式计算环境,对嵌入脚本解析环境后的分布式网络爬虫进行了相关测验,验证该方案在动态页面数据采集中的实用性。测验数据表明该方案是实现动态页面内超链接网络地址获取和网页主体内容采集的有效方法,扩大了数据采集的页面来源。
[Abstract]:At present, with the rapid development of Web2.0, the proportion of dynamic pages embedded with JavaScript scripts on the Internet is increasing, which brings great difficulties to the collection of page data. In the field of network public opinion and search engine research, although the main object of page data acquisition is still static page, the need to collect data in dynamic page is becoming more and more urgent. Based on the study of common script parsing environment, Hadoop distributed computing environment and distributed network crawler principle, a distributed construction scheme of script parsing environment is proposed in this paper. In this scheme, the script parsing environment is embedded into the distributed network crawler, and the data acquisition of dynamic pages is realized. The distributed design scheme of script parsing environment consists of three parts: script parsing task scheduling, script parsing environment construction and script parsing implementation. In the part of script parsing task scheduling, the MapReduce scheduling algorithm of script parsing task is determined on the basis of studying the common scheduling algorithm of Hadoop. In the construction part of script parsing environment, the parsing flow and extraction algorithm of script are designed according to the order of executing script fragments by browser script engine and the embedded form of script fragments in the page, and then the design scheme of binding common browser DOM objects to Rhino script parsing engine is put forward, and the construction of script parsing environment is completed. In the part of script parsing, the script parsing environment is embedded into the distributed network crawler, the overall file architecture and data storage format of the script parsing environment are designed, and the MapReduce implementation of each sub-module of the script parsing environment is completed. Finally, the Hadoop distributed computing environment is built, and the distributed web crawler embedded in script parsing environment is tested to verify the practicability of the scheme in dynamic page data acquisition. The test data show that the scheme is an effective method to realize hyperlink network address acquisition and web page subject content collection in dynamic pages, and expands the page source of data acquisition.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP393.092

【参考文献】