基于Hadoop的搜索引擎的研究与应用

发布时间：2018-04-27 00:17

本文选题：搜索引擎 + Hadoop　；参考：《浙江理工大学》2013年硕士论文

【摘要】：随着网络信息技术的大规模普及，用户对于信息检索的要求日益严格。实现快速、准确且全面的信息搜索能为各类机构获得较高的客户满意度和良好的商业效益。由于技术和经济实力受限，大多数中小型机构难以像大型机构那样根据用户需求实现专有的高效搜索体系，也难以结合中小型机构自身的需求作进一步的个性化设计。因此如何有效利用现有搜索引擎巨头的技术，，为更多机构，尤其是具备一定数据集，但经济承载力较小、核心开发能力较弱的中小型企业、高校及科研机构等提供强大的搜索计算技术和多样化服务，成为当前搜索领域的研究重点和难点。本文结合实际应用需求，研究基于Hadoop的分布式搜索引擎原理、相关技术和算法，深入剖析分布式计算框架MapReduce和分布式文件系统HDFS，引入MapReduce编程模型的具体设计方案，将BM25排序模型集成于Lucene实现检索评分，采用Paoding分词器做中文分词处理，完成了系统在Hadoop平台的架构设计，确定了系统功能划分，分析并设计爬行、索引和检索流程，完成了三个子系统的改进与实现。在分析、评价和总结中小型机构实现信息高效搜索的需求和现存弊端的基础之上，本文集成三个相对独立的子系统的设计与实现，完成了Hadoop框架搭建和相关配置，部署实现了3个节点的分布式搜索引擎系统。最后从中小型机构用户的搜索需求出发，对本系统性能进行测试与评价。具体以浙江理工大学网站作为实验对象，在三节点的分布式平台与单机环境下考察系统进行网页爬取和索引的效率。爬行和索引用时计算结果表明，对于20000个网页，集群用时相比单机节省约15.64%。随着网页数量的增加，该差异逐渐扩大。同时通过比较不同网页数对应的检索结果匹配度，计算得出基于Hadoop的分布式搜索引擎系统检索的平均准确率较单机环境提升了近20%。实验结果表明，在机构网页量增加到一定程度后，该面向中小型机构的分布式搜索引擎系统较传统集中式搜索引擎能更快速获取用户需要的更加精准的检索结果且系统安全稳定性和可扩展性得到提升，从而改善了中小型机构信息检索效能，加快其信息化程度。
[Abstract]:With the widespread popularity of network information technology, users are increasingly demanding information retrieval. Fast, accurate and comprehensive information search can achieve high customer satisfaction and good commercial benefits for various institutions. Because of limited technical and economic strength, most small and medium-sized institutions are difficult to use as large institutions. The user needs to realize the exclusive efficient search system, and it is difficult to make further personalized design in combination with the needs of the small and medium-sized institutions. Therefore, how to effectively use the technology of the existing search engine giant for more organizations, especially the small and medium enterprises with a certain data set, but small economic carrying capacity and weak core development ability And scientific research institutions provide powerful search and computing technology and diversified services, which become the focus and difficulty of the current search field.
This paper studies the principle of distributed search engine based on Hadoop, related technologies and algorithms, analyzes distributed computing framework MapReduce and distributed file system HDFS, and introduces the specific design scheme of MapReduce programming model. The BM25 sorting model is set in Lucene to achieve the retrieval score, and the Paoding participle is adopted. In Chinese word segmentation processing, the architecture design of the system in the Hadoop platform is completed, the system function is divided, the crawl, index and retrieval process are analyzed and designed, and the improvement and implementation of the three subsystems are completed.
Based on the analysis, evaluation and summary of the needs and existing drawbacks of the small and medium institutions to achieve efficient information search, this paper integrates the design and implementation of three relatively independent subsystems, completes the construction of the Hadoop framework and the related configuration, and deploys the distributed search engine system of 3 nodes. Finally, the users of small and medium institutions have been implemented. The performance of the system is tested and evaluated. The efficiency of web crawling and indexing is carried out on the three node distributed platform and single machine environment. The results of crawling and cable reference show that for the 20000 web pages, the clustering is compared to single machine savings. With the increase of the number of web pages, the difference is expanding gradually. At the same time, the average accuracy of the distributed search engine system based on Hadoop is calculated by comparing the matching degree of the retrieval results of different web pages. The results show that the average accuracy of the search engine system based on the Hadoop based distributed search engine is improved by the experimental results of the 20%. experiment. The distributed search engine system oriented to small and medium-sized institutions can get more accurate retrieval results more quickly than the traditional centralized search engine, and improve the security stability and scalability of the system, thus improving the efficiency of information retrieval in small and medium institutions and speeding up its information level.

【学位授予单位】：浙江理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】