基于网络爬虫技术的多源下载系统的设计与实现

发布时间：2018-06-09 01:26

本文选题：网络爬虫 + HTTP　；参考：《北京邮电大学》2011年硕士论文

【摘要】：随着互联网的普及应用以及人们生活水平的提高,越来越多的人们喜欢从互联网上下载资源。现在人们下载资源都需要经过复杂的步骤,不仅效率低下,而且现在的下载工具充斥着大量的广告,如果操作不当,则有可能使用户的电脑陷入死机或者中毒的状态。本文针对上述问题,设计并实现了一款轻巧易用的小型软件。该软件集搜索,存储展示和下载于一体,不仅能够提供大量可下载的URL,而且能够提高下载速率。本文首先介绍了网络爬虫技术和超文本传输协议HTTP,并在传统网络爬虫的基础上进行了扩展。传统的网络爬虫技术只能抓取静态的URL,而对大量深藏在深网络中的动态的URL没有抓取,从而损失了很多更有价值的URL。这样导致了下载效率较低,而且不能够提供足够的URL以供多源下载。本文通过执行JavaScript脚本来解析出深层网络中的动态的URL。执行JavaScript脚本采用的Rhino解析引擎,但是Rhino解析引擎存在两个弊端：一是Rhino无法模拟浏览器内置对象；二是无法解析这些内置对象动态添加的属性和方法。本文对这两个弊端进行了改进,通过添加对DOM操作的支持,使Rhino可以模拟浏览器内置对象。通过修改在浏览器内置对象中的查找方式,使Rhino可以解析浏览器内置对象动态添加的属性和方法。改进之后的Rhino能够解析出更多的URL。本文的存储和展示模块,主要对可下载的URL进行了分组存储和展示,分组有一定的规则,只有相同的文件类型和文件大小的URL才在一组展示,在展示模块采用定时刷新机制。本文的下载模块采用的是多源下载的技术。首先从存储和展示模块中得到经过分组的URL,用户点击下载区域之后,对用户选择的URL分组进行精确的判断,只有真正指向同一个文件下载源的URL才作为多源下载的源地址。判断方式是使用从这些URL中下载相同位置的片段,计算这些片段的MD5值,MD5值相同的URL地址才作为源地址
[Abstract]:With the popularity of the Internet and the improvement of people's living standards, more and more people like to download resources from the Internet. Now people download resources through complex steps, not only inefficient, but also the download tools are filled with a lot of ads, if not used properly, This paper designs and implements a small software which is light and easy to use in view of the above problems. The software integrates search, storage, display and download. It can not only provide a large number of downloadable URLs, but also improve the download rate. Firstly, this paper introduces the web crawler technology and the hypertext transfer protocol HTTP, and extends on the basis of the traditional web crawler. Traditional network crawler technology can only capture static URLs, but not a large number of dynamic URLs hidden deep in the deep network, thus losing a lot of more valuable URLLs. This leads to low download efficiency and the inability to provide sufficient URLs for multiple downloads. This article parses the dynamic URLLs in the deep network by executing JavaScript scripts. The Rhino parsing engine used to execute JavaScript scripts has two disadvantages: one is that Rhino cannot simulate browser built-in objects; the other is that Rhino cannot parse the properties and methods dynamically added by these built-in objects. By adding support for Dom manipulation, Rhino can simulate browser built-in objects. By modifying the lookup method in the browser's built-in objects, Rhino can parse the properties and methods dynamically added by the browser's built-in objects. The improved Rhino can parse more URL.The storage and display modules of this article mainly store and display the downloadable URLs in groups, the grouping has certain rules, only the same file types and file size URLs can be displayed in a group. In the display module the timing refresh mechanism is adopted. The download module of this paper adopts the technology of multi-source download. After the user clicks on the download area, the URL group selected by the user is accurately judged. Only the URL that really points to the same file download source can be used as the source address of the multi-source download. It is judged by downloading fragments in the same location from these URLs and calculating the MD5 values of these fragments and the URL addresses with the same MD5 values as the source addresses
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：TP391.3

【相似文献】