面向主题的多线程网络爬虫的设计与实现

发布时间：2018-04-01 11:00

本文选题：网络爬虫　切入点：主题爬虫　出处：《西北民族大学》2017年硕士论文

【摘要】：网络爬虫是一种自动获取网页内容的程序,通常作为搜索引擎的重要构成从互联网上抓取网页。近年来,互联网的飞速发展使得网络信息呈现爆炸式增长,要从数据的汪洋大海中快速准确地获得需要的信息,通用的网络爬虫已经难以胜任,主题网络爬虫(也被称为聚焦爬虫,focused crawler)由此产生。主题爬虫根据一定的页面分析算法过滤掉跟主题不相关的URL,只保留符合要求的链接,再抓取并存储页面,为下一步的查询和检索提供资源。本文首先对网络爬虫的发展情况与相关技术进行介绍,对主题爬虫关键技术进行分析。着重针对通用网络爬虫的不足,分析了多线程主题网络爬虫工作原理及相关技术,给出主题爬虫的工作流程和总体设计,包括基本功能架构、网页抓取模块组、前端展示模块组、数据库设计以及系统界面的总体设计。通过对主题相关性判断算法的分析,在页面内容的处理上,使用向量空间模型将网页的内容表示成向量,再给这些向量定义一个相似度,这样就可以能够判断出内容的相似度,本文采用基于内容评价的Fish-Search算法来实现这一目标;在对URL的处理上,采用基于链接分析的PageRank算法来实现,根据数量假设和质量假设计算得出的结果可以评价介网页的重要性。本文结合上述两种算法实现主题相关度评价,保证下载的网页与主题之间的相关度,有效地避免"主题漂移"现象,也保证查准率与查全率。在多线程的处理上,本文采用的Python线程池对IO密集型任务比较友好,能够有效提高工作效率。
[Abstract]:Web crawler is a kind of program that automatically acquires the content of web pages, which is usually used as an important component of search engines to grab web pages from the Internet. In recent years, the rapid development of the Internet has caused the explosive growth of network information. In order to quickly and accurately obtain the information needed from the vast ocean of data, the universal web crawler is no longer competent. The topic crawler (also known as focused crawler) is created. The topic crawler filters out URLs that are not related to the topic according to a certain page analysis algorithm, keeps only the links that meet the requirements, and then grabs and stores the page. This paper first introduces the development of web crawler and related technologies, analyzes the key technologies of topic crawler, and focuses on the deficiency of common web crawler. This paper analyzes the working principle and related technology of multi-thread theme web crawler, and gives the workflow and overall design of theme crawler, including basic function structure, web crawling module group, front-end display module group, and so on. The database design and the overall design of the system interface. Through the analysis of the algorithm for judging the relevance of the topic, the vector space model is used to represent the content of the web page into vectors, and then define a similarity degree for these vectors. In this way, we can judge the similarity of content, this paper uses the Fish-Search algorithm based on content evaluation to achieve this goal, and the PageRank algorithm based on link analysis is used to deal with URL. According to the results of quantitative and qualitative assumptions, the importance of web pages can be evaluated. In this paper, we combine the two algorithms to evaluate the correlation between the downloaded pages and the topics, so as to ensure the relevance between the downloaded pages and the topics. The Python thread pool used in this paper is friendly to IO intensive tasks and can effectively improve the working efficiency.
【学位授予单位】：西北民族大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092;TP391.3

【参考文献】