基于Hadoop的分布式垂直搜索引擎研究与设计
发布时间:2018-08-07 14:59
【摘要】:随着互联网的发展,网络技术日趋成熟,互联网上的站点越来越多,信息量非常的巨大。但是由于网络技术的发展与网络资源增长速度加快,网络信息的用户也越来越多,相比之下,传统综合搜索引擎存在覆盖率范围有限、返回结果多而繁杂、更新周期长以及查询歧义等诸多问题。 与此同时,信息多元化的不断增长,不同用户的检索需求存在很大差异,传统综合搜索引擎已不能有针对性地满足不同的检索需求。且目前成功运营的商业搜索引擎大部分采用了集中式体系结构,系统对单台服务器性能要求高,易出现故障、扩展性差等。针对这些缺点,一个性能佳、容错好、扩展容易、分类细致精确、数据全面深入、更新及时的分布式垂直搜索便应运而生。 分布式是指多台服务器构建一个集群,服务器之间相互协调进行工作;垂直搜索是指针对某一行业的专业搜索,其特点是“专、精、深”,具有鲜明行业特色,是通用搜索引擎的细分和延伸。本课题采用Hadoop搭建了分布式集群,然后对开源搜索组件Nutch和Solr进行源码分析,接着深入了解搜索引擎相关理论知识和研究搜索引擎的关键技术,在此基础上借鉴已有学术成果,,在主题相关性判别、网页检索排序等方面做了一些改进,利用领域本体知识构建钢铁领域本体库,扩展用户查询条件,使信息的定位和查找更加的精确,最后修改开源搜索组件源代码基于Hadoop设计并实现了分布式垂直搜索引擎雏形,并与百度商业搜索引擎比较搜索结果,对实验结果进行分析和评价后,证明本系统具有明显的主题倾向性,查准率优于通用搜索引擎。
[Abstract]:With the development of the Internet, network technology is becoming more and more mature, more and more sites on the Internet, the amount of information is very huge. However, due to the rapid development of network technology and the rapid growth of network resources, more and more users of network information, by contrast, the traditional comprehensive search engine has limited coverage, returns many and complex results. Long update period and query ambiguity and many other issues. At the same time, with the increasing of information diversification, the retrieval needs of different users are very different. The traditional integrated search engine can no longer meet the different retrieval needs. Most of the successful commercial search engines use centralized architecture. The system requires high performance of a single server, prone to failure, poor scalability and so on. In order to solve these problems, a distributed vertical search with timely updating is proposed, which has the advantages of good performance, good fault tolerance, easy expansion, precise classification and thorough data. Distributed refers to the construction of a cluster of multiple servers, where servers work in coordination with each other. Vertical search refers to a professional search for a particular industry, which is characterized by "specialty, precision, depth", with distinctive industry characteristics. General search engine is the subdivision and extension. This paper uses Hadoop to build a distributed cluster, then analyzes the open source search components Nutch and Solr, then deeply understand the relevant theoretical knowledge of search engine and research the key technologies of search engine, and draw lessons from the existing academic achievements. Some improvements have been made in the aspects of topic correlation discrimination, web search and ranking. The domain ontology knowledge is used to construct the steel domain ontology database, and the query conditions of users are extended, so that the information can be located and searched more accurately. Finally, the prototype of distributed vertical search engine is designed and implemented based on Hadoop, and the search results are compared with those of Baidu commercial search engine, and the experimental results are analyzed and evaluated. It is proved that the system has obvious thematic tendency and the precision rate is superior to that of the general search engine.
【学位授予单位】:河北工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
本文编号:2170395
[Abstract]:With the development of the Internet, network technology is becoming more and more mature, more and more sites on the Internet, the amount of information is very huge. However, due to the rapid development of network technology and the rapid growth of network resources, more and more users of network information, by contrast, the traditional comprehensive search engine has limited coverage, returns many and complex results. Long update period and query ambiguity and many other issues. At the same time, with the increasing of information diversification, the retrieval needs of different users are very different. The traditional integrated search engine can no longer meet the different retrieval needs. Most of the successful commercial search engines use centralized architecture. The system requires high performance of a single server, prone to failure, poor scalability and so on. In order to solve these problems, a distributed vertical search with timely updating is proposed, which has the advantages of good performance, good fault tolerance, easy expansion, precise classification and thorough data. Distributed refers to the construction of a cluster of multiple servers, where servers work in coordination with each other. Vertical search refers to a professional search for a particular industry, which is characterized by "specialty, precision, depth", with distinctive industry characteristics. General search engine is the subdivision and extension. This paper uses Hadoop to build a distributed cluster, then analyzes the open source search components Nutch and Solr, then deeply understand the relevant theoretical knowledge of search engine and research the key technologies of search engine, and draw lessons from the existing academic achievements. Some improvements have been made in the aspects of topic correlation discrimination, web search and ranking. The domain ontology knowledge is used to construct the steel domain ontology database, and the query conditions of users are extended, so that the information can be located and searched more accurately. Finally, the prototype of distributed vertical search engine is designed and implemented based on Hadoop, and the search results are compared with those of Baidu commercial search engine, and the experimental results are analyzed and evaluated. It is proved that the system has obvious thematic tendency and the precision rate is superior to that of the general search engine.
【学位授予单位】:河北工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前8条
1 谭月辉;肖冰;陈建泗;齐京礼;李志勇;;Jena推理机制及应用研究[J];河北省科学院学报;2009年04期
2 宋玉银,蔡复之,张伯鹏,许隆文;面向并行工程的集成产品信息建模技术研究[J];计算机研究与发展;1998年02期
3 郑霄;李宏亮;吴东;原昊;;分布式状态空间生成的设计与实现[J];计算机工程与应用;2009年32期
4 胡玉杰,李善平,郭鸣;基于本体的产品知识表达[J];计算机辅助设计与图形学学报;2003年12期
5 孙正兴,张福炎;特征设计方法在方案设计中的应用初探[J];机械设计与研究;1999年01期
6 刘琳娜;薛建武;汪小梅;;领域本体构建方法的研究[J];情报杂志;2007年04期
7 封硕;赵捧未;施水才;;基于RSS的分布式博客搜索引擎的研究[J];情报杂志;2007年08期
8 耿科明;袁方;;Jena推理机在基于本体的信息检索中的应用[J];微型机与应用;2005年10期
本文编号:2170395
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2170395.html