基于Hadoop的分布式垂直搜索引擎研究与设计

发布时间：2018-08-07 14:59

【摘要】：随着互联网的发展，网络技术日趋成熟，互联网上的站点越来越多，信息量非常的巨大。但是由于网络技术的发展与网络资源增长速度加快，网络信息的用户也越来越多，相比之下，传统综合搜索引擎存在覆盖率范围有限、返回结果多而繁杂、更新周期长以及查询歧义等诸多问题。与此同时，信息多元化的不断增长，不同用户的检索需求存在很大差异，传统综合搜索引擎已不能有针对性地满足不同的检索需求。且目前成功运营的商业搜索引擎大部分采用了集中式体系结构，系统对单台服务器性能要求高，易出现故障、扩展性差等。针对这些缺点，一个性能佳、容错好、扩展容易、分类细致精确、数据全面深入、更新及时的分布式垂直搜索便应运而生。分布式是指多台服务器构建一个集群，服务器之间相互协调进行工作；垂直搜索是指针对某一行业的专业搜索，其特点是“专、精、深”，具有鲜明行业特色，是通用搜索引擎的细分和延伸。本课题采用Hadoop搭建了分布式集群，然后对开源搜索组件Nutch和Solr进行源码分析，接着深入了解搜索引擎相关理论知识和研究搜索引擎的关键技术，在此基础上借鉴已有学术成果，，在主题相关性判别、网页检索排序等方面做了一些改进，利用领域本体知识构建钢铁领域本体库，扩展用户查询条件，使信息的定位和查找更加的精确，最后修改开源搜索组件源代码基于Hadoop设计并实现了分布式垂直搜索引擎雏形，并与百度商业搜索引擎比较搜索结果，对实验结果进行分析和评价后，证明本系统具有明显的主题倾向性，查准率优于通用搜索引擎。
[Abstract]:With the development of the Internet, network technology is becoming more and more mature, more and more sites on the Internet, the amount of information is very huge. However, due to the rapid development of network technology and the rapid growth of network resources, more and more users of network information, by contrast, the traditional comprehensive search engine has limited coverage, returns many and complex results. Long update period and query ambiguity and many other issues. At the same time, with the increasing of information diversification, the retrieval needs of different users are very different. The traditional integrated search engine can no longer meet the different retrieval needs. Most of the successful commercial search engines use centralized architecture. The system requires high performance of a single server, prone to failure, poor scalability and so on. In order to solve these problems, a distributed vertical search with timely updating is proposed, which has the advantages of good performance, good fault tolerance, easy expansion, precise classification and thorough data. Distributed refers to the construction of a cluster of multiple servers, where servers work in coordination with each other. Vertical search refers to a professional search for a particular industry, which is characterized by "specialty, precision, depth", with distinctive industry characteristics. General search engine is the subdivision and extension. This paper uses Hadoop to build a distributed cluster, then analyzes the open source search components Nutch and Solr, then deeply understand the relevant theoretical knowledge of search engine and research the key technologies of search engine, and draw lessons from the existing academic achievements. Some improvements have been made in the aspects of topic correlation discrimination, web search and ranking. The domain ontology knowledge is used to construct the steel domain ontology database, and the query conditions of users are extended, so that the information can be located and searched more accurately. Finally, the prototype of distributed vertical search engine is designed and implemented based on Hadoop, and the search results are compared with those of Baidu commercial search engine, and the experimental results are analyzed and evaluated. It is proved that the system has obvious thematic tendency and the precision rate is superior to that of the general search engine.
【学位授予单位】：河北工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】