基于Nutch的分布式搜索引擎的研究与优化
发布时间:2018-07-10 20:24
本文选题:Nutch + 索引 ; 参考:《武汉理工大学》2013年硕士论文
【摘要】:云计算已发展成为目前计算机产业界和学术界关注的热点之一,Hadoop,作为当今最流行的云计算平台,也得到了越来越广泛的应用。与此同时,开放源代码搜索引擎包Nutch不仅能提供搜索引擎所需要的工具,还具有极好的扩展性,越来越多的学者围绕Hadoop和Nutch的结合展开研究,力图通过各种途径来提高分布式搜索的性能,本文正是在这些学者的研究成果上,开展了基于Nutch和Hadoop的分布式搜索引擎的研究和优化等相关工作。 本文具体研究工作包括:Nutch框架、Hadoop分布式平台和分布式爬虫原理三个方面。首先,对Nutch框架和Hadoop分布式平台进行了分析和研究,仔细剖析了其架构及主要工作原理,如索引机制、插件机制、HDFS,Map/Reduce等核心技术。接着重点研究了爬虫技术,特别是分布式爬虫技术,通过分析和研究现有的基于Nutch的爬取机制,从改变数据结构入手,在任务分配算法中引入可扩展的哈希函数,从而解决了原有算法中负载均衡性和算法低效率的问题。 在上述研究工作的基础上,本文设计了基于Nutch和Hadoop的分布式搜索系统,在所设计系统的索引模块中采用了可扩展的hash函数,在索引和搜索模块中利用Nutch的可扩展性,通过引入中科院的汉语词法分析系统ICTCLAS,有效地改进了Nutch对中文的支持力。 最后,本文对所设计的搜索系统,在实验室构建的集群基础上,从多个角度进行了功能测试、性能测试和综合评估,测试结果不仅验证了所设计的系统的可行性和可扩展性,还验证了其性能的提升。
[Abstract]:Cloud computing has become one of the hot topics in computer industry and academia. As the most popular cloud computing platform, cloud computing has been more and more widely used. At the same time, the open source search engine package Nutch not only provides the tools that search engines need, but also has excellent expansibility. More and more scholars are studying the combination of Hadoop and Nutch. This paper tries to improve the performance of distributed search engine through various ways. In this paper, the research and optimization of distributed search engine based on Nutch and Hadoop are carried out. The research work in this paper includes three aspects: Hadoop distributed platform and distributed crawler principle. Firstly, the Nutch framework and Hadoop distributed platform are analyzed and studied, and its architecture and main working principles are analyzed in detail, such as index mechanism, plug-in mechanism, HDFSMapP / Reduce and other core technologies. Then, the crawler technology, especially the distributed crawler technology, is studied emphatically. By analyzing and studying the existing crawling mechanism based on Nutch, the scalable hash function is introduced into the task assignment algorithm by changing the data structure. Thus, the problems of load balance and low efficiency of the original algorithm are solved. Based on the above research work, a distributed search system based on Nutch and Hadoop is designed. The extensible hash function is used in the index module of the designed system, and the extensibility of Nutch is used in the index and search module. By introducing the Chinese lexical analysis system (ICTCLASS) of the Chinese Academy of Sciences (CAS), Nutch's support for Chinese is improved effectively. Finally, on the basis of the cluster constructed in the laboratory, the function test, performance test and comprehensive evaluation of the designed search system are carried out. The test results not only verify the feasibility and expansibility of the designed system. The improvement of its performance is also verified.
【学位授予单位】:武汉理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前3条
1 潘以锋;;基于Lucene的网站全文检索系统的开发[J];广西教育学院学报;2006年05期
2 张元丰;董守斌;张凌;陈晓志;;基于Map/Reduce的网页消重并行算法[J];广西师范大学学报(自然科学版);2007年02期
3 张岭,叶允明,宋晖,于水,马范援;一种高性能分布式Web Crawler的设计与实现[J];上海交通大学学报;2004年01期
相关硕士学位论文 前6条
1 董长春;基于Hadoop的倒排索引技术的研究[D];辽宁大学;2011年
2 苏旋;分布式网络爬虫技术的研究与实现[D];哈尔滨工业大学;2006年
3 朱珠;基于Hadoop的海量数据处理模型研究和应用[D];北京邮电大学;2008年
4 时延军;基于Nutch的分布式搜索引擎的设计与研究[D];长春理工大学;2010年
5 程锦佳;基于Hadoop的分布式爬虫及其实现[D];北京邮电大学;2010年
6 吴翠雁;基于Nutch的信息采集系统的研究与实现[D];华南理工大学;2010年
,本文编号:2114599
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2114599.html