面向垂直搜索的网络爬虫设计与实现

发布时间：2018-02-15 10:56

本文关键词： 垂直搜索引擎网络爬虫 Heritrix ChangyouSpider 异步加载　出处：《北京邮电大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着互联网的高速发展,互联网上的数据也在超乎人想象的急剧增长,人们对数据和信息的需求也在持续的增长。搜索引擎可以帮助人们从海量的数据中检索出需要的信息和数据,所以搜索引擎已经成为人们日常生活中必不可少的工具之一,它也影响了人们日常的记忆习惯。通过分析当今全球主流的搜索引擎包括Google、百度、Yahool、Bing、搜狗等,我们发现这些主流的搜索引擎都包含了三个主要的部分：网络爬虫、索引、前端搜索,这三部分也是在工业界认可的三个主要部分。但是随着人们对信息的要求程度越来越高,目前搜索引擎的通用搜索已经不能满足人们的专门的需求,因为通用搜索引擎搜索结果信息量大,深度不够。基于此,垂直搜索大力发展起来。由于垂直搜索的专业性,以及对专属领域的深度,深受广大网民的青睐。强大的搜索引擎离不开数据的支持,而搜索引擎数据来源主要通过网络爬虫来获取,所以网络爬虫对于搜索引擎至关重要。本文面向垂直搜索引擎,搭建畅邮系统,提供通用搜索、图书搜索和视频搜索三种服务。首先,根据校园网络状况设计畅邮系统的架构,聚合三种服务于同一入口,并使畅邮系统能够实现不同网段的无缝访问。其次,调研当前主流开源网络爬虫,根据畅邮系统的需求选定Heritrix为网络爬虫原型,并分析其源码。在此基础上,对Heritrix进行高度定制,并解决异步加载抓取的问题。然后,根据Heritrix的运行状态和特点,设计并实现了适合垂直搜索引擎增量抓取的ChangyouSpider,它轻便、高效,以弥补Heritrix的不足,由此本文结合Heritrix和ChangyouSpider使用作为垂直搜索引擎抓取的网络爬虫。最后,对抓取的数据从全面性、杂质率、异步加载抓取、页面有效性的测试验证了爬虫的功能和性能。
[Abstract]:With the rapid development of the Internet, the data on the Internet is also growing rapidly beyond the imagination. The demand for data and information continues to grow. Search engines can help people retrieve the information and data they need from massive amounts of data, so search engines have become one of the essential tools in our daily lives. It also affects people's daily memory habits. By analyzing the global mainstream search engines today, including Google, Baidu Yahoolan, Bing Sogou, and so on, we find that these mainstream search engines all contain three main parts: web crawlers, indexes, etc. Front-end search, these three parts are also the three major parts recognized in industry. But with the increasing demand for information, the current search engine general search engine can no longer meet the specific needs of people. Because of the large amount of information and the lack of depth in the search results of the general search engine. Based on this, the vertical search has developed vigorously. Because of the professionalism of the vertical search, and the depth of the exclusive field, The powerful search engine can not be separated from the support of the data, and the search engine data source is mainly obtained by the web crawler, so the web crawler is very important to the search engine. Build the Changyou system to provide three services: universal search, book search and video search. Firstly, according to the campus network conditions, the structure of the Changyou system is designed to aggregate three kinds of services to the same entrance. Secondly, we investigate the current mainstream open source web crawlers, select Heritrix as the web crawler prototype according to the needs of the Changyou system, and analyze its source code. The Heritrix is highly customized, and the problem of asynchronous loading and fetching is solved. Then, according to the running state and characteristics of Heritrix, ChangyouSpideris designed and implemented, which is suitable for incremental capture of vertical search engine. It is light and efficient to make up for the shortage of Heritrix. In this paper, Heritrix and ChangyouSpider are used as web crawlers for vertical search engines. Finally, the crawler's function and performance are verified by the tests of comprehensiveness, impurity rate, asynchronous load grab and page validity.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【相似文献】