当前位置:主页 > 科技论文 > 软件论文 >

面向主题的多线程网络爬虫的设计与实现

发布时间:2018-04-01 11:00

  本文选题:网络爬虫 切入点:主题爬虫 出处:《西北民族大学》2017年硕士论文


【摘要】:网络爬虫是一种自动获取网页内容的程序,通常作为搜索引擎的重要构成从互联网上抓取网页。近年来,互联网的飞速发展使得网络信息呈现爆炸式增长,要从数据的汪洋大海中快速准确地获得需要的信息,通用的网络爬虫已经难以胜任,主题网络爬虫(也被称为聚焦爬虫,focused crawler)由此产生。主题爬虫根据一定的页面分析算法过滤掉跟主题不相关的URL,只保留符合要求的链接,再抓取并存储页面,为下一步的查询和检索提供资源。本文首先对网络爬虫的发展情况与相关技术进行介绍,对主题爬虫关键技术进行分析。着重针对通用网络爬虫的不足,分析了多线程主题网络爬虫工作原理及相关技术,给出主题爬虫的工作流程和总体设计,包括基本功能架构、网页抓取模块组、前端展示模块组、数据库设计以及系统界面的总体设计。通过对主题相关性判断算法的分析,在页面内容的处理上,使用向量空间模型将网页的内容表示成向量,再给这些向量定义一个相似度,这样就可以能够判断出内容的相似度,本文采用基于内容评价的Fish-Search算法来实现这一目标;在对URL的处理上,采用基于链接分析的PageRank算法来实现,根据数量假设和质量假设计算得出的结果可以评价介网页的重要性。本文结合上述两种算法实现主题相关度评价,保证下载的网页与主题之间的相关度,有效地避免"主题漂移"现象,也保证查准率与查全率。在多线程的处理上,本文采用的Python线程池对IO密集型任务比较友好,能够有效提高工作效率。
[Abstract]:Web crawler is a kind of program that automatically acquires the content of web pages, which is usually used as an important component of search engines to grab web pages from the Internet. In recent years, the rapid development of the Internet has caused the explosive growth of network information. In order to quickly and accurately obtain the information needed from the vast ocean of data, the universal web crawler is no longer competent. The topic crawler (also known as focused crawler) is created. The topic crawler filters out URLs that are not related to the topic according to a certain page analysis algorithm, keeps only the links that meet the requirements, and then grabs and stores the page. This paper first introduces the development of web crawler and related technologies, analyzes the key technologies of topic crawler, and focuses on the deficiency of common web crawler. This paper analyzes the working principle and related technology of multi-thread theme web crawler, and gives the workflow and overall design of theme crawler, including basic function structure, web crawling module group, front-end display module group, and so on. The database design and the overall design of the system interface. Through the analysis of the algorithm for judging the relevance of the topic, the vector space model is used to represent the content of the web page into vectors, and then define a similarity degree for these vectors. In this way, we can judge the similarity of content, this paper uses the Fish-Search algorithm based on content evaluation to achieve this goal, and the PageRank algorithm based on link analysis is used to deal with URL. According to the results of quantitative and qualitative assumptions, the importance of web pages can be evaluated. In this paper, we combine the two algorithms to evaluate the correlation between the downloaded pages and the topics, so as to ensure the relevance between the downloaded pages and the topics. The Python thread pool used in this paper is friendly to IO intensive tasks and can effectively improve the working efficiency.
【学位授予单位】:西北民族大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092;TP391.3

【参考文献】

相关期刊论文 前10条

1 赵兵;郭才正;;深网和搜索引擎[J];情报探索;2016年01期

2 胡秀丽;;基于VSM和LDA模型相结合的微博话题漂移检测[J];兰州理工大学学报;2015年05期

3 邹睿;肖达;肖睿卿;刘胜利;;一种基于计数型Bloom Filter的报文分类算法[J];信息工程大学学报;2015年05期

4 管莹;;基于CSS框架的应用网站设计[J];电脑知识与技术;2015年04期

5 陈睿嘉;康志忠;张卫涛;;基于网络爬虫的导航深度服务信息自动采集[J];测绘工程;2015年01期

6 范意兴;郭岩;李希鹏;赵岭;刘悦;俞晓明;程学旗;;一种基于网页块特征的多级网页聚类方法[J];山东大学学报(理学版);2015年07期

7 黄冲;;MVC构架模式下的Web应用设计与分析[J];电子技术与软件工程;2014年14期

8 董日壮;郭曙超;;网络爬虫的设计与实现[J];电脑知识与技术;2014年17期

9 舒奔;尹珂;;基于内容与链接分析的主题爬虫研究与设计[J];计算机与现代化;2014年04期

10 孙青云;王俊峰;赵宗渠;高梦超;;一种基于模拟登录的微博数据采集方案[J];计算机技术与发展;2014年03期



本文编号:1695247

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1695247.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户ef6b3***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com