基于Hadoop平台的网络爬虫技术研究

发布时间：2018-05-30 10:32

本文选题：网络爬虫 + Hadoop　；参考：《南京邮电大学》2017年硕士论文

【摘要】：互联网的飞速发展带来了互联网内容信息的爆炸式增长,同时如此高的信息数量级也给从其中获取自己所需要的信息带来了巨大挑战。面对如此巨大的信息检索以及用户的个性化检索需求,如何提高网络信息搜索的效率与准确率成为一个急需解决的关键问题,在网络信息搜索技术中,网络爬虫技术是其重要的组成部分。在依靠单个计算机难以完成如此庞大的任务的背景下,使用Hadoop云平台实现分布式计算与存储,在Hadoop平台上运行改进后的网络爬虫技术以达到高效、准确地抓取信息。基于Hadoop云平台和网路爬虫技术的深入研究,发现现有主题爬取算法的不足并对其进行改进,提出优化特征词提取、基于语义树改进相关度计算、基于权重优化链接排序的主题爬取算法,并在云平台上进行MapReduce处理,提高主题爬取算法的效率与准确率。针对链接去重,提出了一种基于布隆过滤器改进的链接去重算法,在优化布隆过滤器的存储结构上,基于属性对链接分层,形成分层布隆过滤器树对链接进行快速准确去重,在云平台上进行处理,改进算法性能和时空间效率,最终得到更有效、更精准的链接去重算法。在研究Hadoop网络爬虫系统原理的基础上构建系统,并详细设计实现系统的网页下载模块、网页文档解析模块、链接处理模块,将所提出的改进算法应用在关键功能模块的实现中。在构建系统的基础上,实验验证所提出的改进算法,结果表明其在算法性能和效率提高方面可行有效。
[Abstract]:The rapid development of the Internet has brought the explosive growth of Internet content information, at the same time, such a high level of information has also brought great challenges to get the information they need from it. In the face of the huge demand of information retrieval and personalized retrieval of users, how to improve the efficiency and accuracy of network information search becomes a key problem that needs to be solved urgently. In the network information search technology, how to improve the efficiency and accuracy of network information search has become a key problem. Web crawler technology is an important part of it. Under the background that it is difficult to accomplish such a huge task by relying on a single computer, distributed computing and storage are realized by using Hadoop cloud platform, and the improved network crawler technology is run on Hadoop platform in order to efficiently and accurately capture information. Based on the deep research of Hadoop cloud platform and web crawler technology, this paper finds out the deficiency of the existing topic crawling algorithm and improves it, proposes the optimized feature word extraction, and improves the correlation calculation based on semantic tree. The topic crawling algorithm based on weight optimization link sorting and MapReduce processing on cloud platform can improve the efficiency and accuracy of the topic crawling algorithm. In this paper, an improved link removal algorithm based on Bron filter is proposed. In order to optimize the storage structure of Bron filter, the link is stratified based on attributes, and a hierarchical Bron filter tree is formed to remove the link quickly and accurately. The algorithm is processed on the cloud platform to improve the algorithm performance and time space efficiency, and finally get more effective and accurate link removal algorithm. On the basis of studying the principle of Hadoop web crawler system, the system is constructed, and the web page download module, the page document analysis module and the link processing module are designed and implemented in detail. The proposed improved algorithm is applied to the implementation of the key function module. On the basis of constructing the system, the proposed improved algorithm is verified by experiments. The results show that the proposed algorithm is feasible and effective in improving the performance and efficiency of the algorithm.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092;TP391.1

【参考文献】