爬虫日志数据信息抽取与统计系统设计与实现

发布时间：2018-05-26 01:06

本文选题：信息抽取 + 爬虫指标数据统计　；参考：《北京邮电大学》2012年硕士论文

【摘要】：随着网络信息的膨胀,人们更大程度上越来越依靠搜索引擎。爬虫作为搜索引擎不可或缺的一部分,它抓取网页质量的好坏,直接影响着整个搜索引擎的搜索效果。因此即使检索,索引等相关工作做的很好很完美,而爬虫收录的大部分是些垃圾网页,那么用户体验也无从谈起。这样就需要根据抓取效果来调整爬虫的调度和抓取策略。那么怎样才能评价爬虫抓取网页的质量和效果呢?这就是本文爬虫日志数据信息抽取与统计系统需要解决的问题。本文所作的工作如下： 1_爬虫在种子合并调度和网页下载时会记录日志,爬虫的这些相关日志文件分布在爬虫部署集群的每个节点上,本文将对每个节点上的爬虫日志数据进行收集,合并归档压缩处理,然后将处理好的压缩文件上传到分布式文件存储系统HDFS上,最后对压缩文件产生索引文件。 2.对于一个分布式爬虫集群来说,若每天下载的url数目控制在8亿到十几亿之间,那么每天爬虫日志至少会在几百GB级,所以每天上传到分布式文件存储系统HDFS上的压缩文件也在150GB左右,单机对于处理海量数据显得力不从心,因此本文采用信息抽取技术作为技术基础,通过Hadoop作为计算平台,利用Hive对爬虫日志数据进行结构化处理,由Hql语句将爬虫关心的统计指标转化成Job提交Hadoop集群处理,最后将MapReduce计算之后的指标结果数据导入到Mysql数据库中。 3.最后本文采用PHP的轻量级框架CI (Codelgniter)对导入到Mysql中的爬虫指标数据信息进行页而展示和报表邮件发送。实验数据表明,本文以爬虫的日志数据作为数据来源,采用Hadoop, Hive的海量数据处理平台,能在有限的时间内完成有效信息的抽取,为爬虫的策略调整提供可靠的数据支持。
[Abstract]:With the expansion of network information, people rely more and more on search engines. As an indispensable part of search engine, crawler grabs the quality of web pages, which directly affects the search results of the whole search engine. So even if the search, indexing and other related work is done well and perfectly, and most of the crawlers are garbage pages, then the user experience is impossible to talk about. Therefore, it is necessary to adjust the crawler scheduling and crawling strategy according to the capture effect. So how can we evaluate the quality and effectiveness of crawler crawling web pages? This is the problem that the crawler log data extraction and statistics system needs to solve. The work done in this paper is as follows: 1 _ crawler logs are recorded during seed merge scheduling and web page download. These log files are distributed on each node of the crawler deployment cluster. This paper will collect the crawler log data on each node. Then the compressed files are uploaded to the distributed file storage system (HDFS), and the index files are generated for the compressed files. 2. For a distributed crawler cluster, if the number of url downloads per day is kept between 800 million and more than a billion, then the crawler log will be at least several hundred gigabytes a day. So the compressed files uploaded to the distributed file storage system (HDFS) every day are also around 150GB, so the single machine is unable to deal with the massive data, so this paper adopts the information extraction technology as the technical foundation, and uses Hadoop as the computing platform. The crawler log data are processed structurally by Hive, and the statistical indexes concerned by the reptiles are transformed by Hql statements into Job submitted to Hadoop cluster processing. Finally, the index data after MapReduce calculation is imported into the Mysql database. 3. Finally, this paper uses the lightweight framework of PHP (CI / Codelgniteri) to page and display the crawler index data imported into Mysql and send the report mail. The experimental data show that the crawler log data is used as the data source, and the massive data processing platform of Hadoop and Hive is used to extract the effective information in a limited time, which can provide reliable data support for the policy adjustment of the crawler.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】