基于Hadoop平台的网络爬虫技术研究
本文选题:网络爬虫 + Hadoop ; 参考:《南京邮电大学》2017年硕士论文
【摘要】:互联网的飞速发展带来了互联网内容信息的爆炸式增长,同时如此高的信息数量级也给从其中获取自己所需要的信息带来了巨大挑战。面对如此巨大的信息检索以及用户的个性化检索需求,如何提高网络信息搜索的效率与准确率成为一个急需解决的关键问题,在网络信息搜索技术中,网络爬虫技术是其重要的组成部分。在依靠单个计算机难以完成如此庞大的任务的背景下,使用Hadoop云平台实现分布式计算与存储,在Hadoop平台上运行改进后的网络爬虫技术以达到高效、准确地抓取信息。基于Hadoop云平台和网路爬虫技术的深入研究,发现现有主题爬取算法的不足并对其进行改进,提出优化特征词提取、基于语义树改进相关度计算、基于权重优化链接排序的主题爬取算法,并在云平台上进行MapReduce处理,提高主题爬取算法的效率与准确率。针对链接去重,提出了一种基于布隆过滤器改进的链接去重算法,在优化布隆过滤器的存储结构上,基于属性对链接分层,形成分层布隆过滤器树对链接进行快速准确去重,在云平台上进行处理,改进算法性能和时空间效率,最终得到更有效、更精准的链接去重算法。在研究Hadoop网络爬虫系统原理的基础上构建系统,并详细设计实现系统的网页下载模块、网页文档解析模块、链接处理模块,将所提出的改进算法应用在关键功能模块的实现中。在构建系统的基础上,实验验证所提出的改进算法,结果表明其在算法性能和效率提高方面可行有效。
[Abstract]:The rapid development of the Internet has brought the explosive growth of Internet content information, at the same time, such a high level of information has also brought great challenges to get the information they need from it. In the face of the huge demand of information retrieval and personalized retrieval of users, how to improve the efficiency and accuracy of network information search becomes a key problem that needs to be solved urgently. In the network information search technology, how to improve the efficiency and accuracy of network information search has become a key problem. Web crawler technology is an important part of it. Under the background that it is difficult to accomplish such a huge task by relying on a single computer, distributed computing and storage are realized by using Hadoop cloud platform, and the improved network crawler technology is run on Hadoop platform in order to efficiently and accurately capture information. Based on the deep research of Hadoop cloud platform and web crawler technology, this paper finds out the deficiency of the existing topic crawling algorithm and improves it, proposes the optimized feature word extraction, and improves the correlation calculation based on semantic tree. The topic crawling algorithm based on weight optimization link sorting and MapReduce processing on cloud platform can improve the efficiency and accuracy of the topic crawling algorithm. In this paper, an improved link removal algorithm based on Bron filter is proposed. In order to optimize the storage structure of Bron filter, the link is stratified based on attributes, and a hierarchical Bron filter tree is formed to remove the link quickly and accurately. The algorithm is processed on the cloud platform to improve the algorithm performance and time space efficiency, and finally get more effective and accurate link removal algorithm. On the basis of studying the principle of Hadoop web crawler system, the system is constructed, and the web page download module, the page document analysis module and the link processing module are designed and implemented in detail. The proposed improved algorithm is applied to the implementation of the key function module. On the basis of constructing the system, the proposed improved algorithm is verified by experiments. The results show that the proposed algorithm is feasible and effective in improving the performance and efficiency of the algorithm.
【学位授予单位】:南京邮电大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092;TP391.1
【参考文献】
相关期刊论文 前10条
1 郭建华;杨洪斌;陈圣波;;基于HDFS的海量视频数据重分布算法[J];计算机科学;2016年S1期
2 彭丽针;吴扬扬;;基于维基百科社区挖掘的词语语义相似度计算[J];计算机科学;2016年04期
3 陶耀东;向中希;;基于改进Kademlia协议的分布式爬虫[J];计算机系统应用;2016年04期
4 宋宝燕;王俊陆;王妍;;基于范德蒙码的HDFS优化存储策略研究[J];计算机学报;2015年09期
5 宋杰;徐澍;郭朝鹏;鲍玉斌;于戈;;一种优化MapReduce系统能耗的任务分发算法[J];计算机学报;2016年02期
6 王鹏超;杜慧敏;曹广界;杜琴琴;丁家隆;;基于布隆过滤器的精确匹配算法设计与实现[J];计算机科学;2015年S1期
7 孔涛;曹丙章;邱荷花;;基于MapReduce的视频爬虫系统研究[J];华中科技大学学报(自然科学版);2015年05期
8 李璐;张国印;李正文;;基于SVM的主题爬虫技术研究[J];计算机科学;2015年02期
9 于娟;刘强;;主题网络爬虫研究综述[J];计算机工程与科学;2015年02期
10 严磊;丁宾;姚志敏;马勇男;郑涛;;基于MD5去重树的网络爬虫的设计与优化[J];计算机应用与软件;2015年02期
相关博士学位论文 前1条
1 张智雄;Internet科技信息资源门户网站(STIP)系统的实践研究[D];中国科学院文献情报中心;2000年
相关硕士学位论文 前2条
1 么士宇;基于分布式计算的网络爬虫技术研究[D];大连海事大学;2011年
2 杨玲;面向云计算的MapReduce并行编程模式的研究与应用[D];湖南大学;2011年
,本文编号:1954977
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1954977.html