基于爬虫的Sohu新闻搜索引擎设计与实现
发布时间:2018-01-23 12:29
本文关键词: 搜索引擎 排序算法 Lucene PageRank Hadoop 出处:《中山大学》2012年硕士论文 论文类型:学位论文
【摘要】:互联网信息增长速度惊人,为了在海量数据中快速找到有用的信息,搜索引擎技术成为了网民关注的热点。本论文的新闻搜索引擎就是在这样的环境下应运而生。 对普通用户来说,商业的搜索引擎基本上能满足其应用需求。但是对于特定的用户来说,譬如中小企业用户或者科研机构等,因为商业搜索引擎信息的针对性较低,同时存在不能按需配置等缺陷,他们的应用需求不能通过商业互联网搜索引擎得到完全满足。Lucene等开源软件的出现很好地满足了这个需求,由于它们是完全开源的,开发人员完全能够根据需求开发出适用于具体领域的搜索引擎。本文系统就是基于开源软件设计并实现的。 本文首先介绍了搜索引擎的发展历史、趋势及搜索引擎的分类,然后,阐述了系统需求分析,明确系统功能需求与非功能需求,接着设计系统框架与相关系统体系结构,最后详细设计各个功能模块并将之实现。 本系统为基于爬虫的Sohu新闻搜索引擎,使用二次开发方法,实现了Heritrix数据抓取模块,HTMLParser数据预处理模块,Lucene索引与Oracle数据库数据生成模块及Lucene搜索核心处理模块等。为了提高用户体验,,本文结合Lucene文本匹配算法与PageRank算法,并考虑了时间因素对新闻搜索引擎的影响,提出了一种改进的页面排序算法,在此基础上,设计并实现了一种基于Lucene与Hadoop分布式存储与分布式计算的算法实现方案,从而使展现给用户的搜索结果更加准确,更加合理。
[Abstract]:In order to find useful information quickly in mass data , search engine technology has become a hot spot for Internet users . The news search engine of this paper is born in such an environment . For ordinary users , commercial search engines can basically meet their application requirements . But for specific users , such as small and medium - sized enterprises users or scientific research institutions , etc . , because of the low pertinence of commercial search engine information , and the existence of such defects as cannot be configured on demand , their application needs cannot be fully met through commercial Internet search engines . Since they are fully open source , developers can develop search engines suitable for specific fields according to requirements . The system is based on open source software design and realized . This paper first introduces the development history , trend of search engine and the classification of search engine , then expounds the system requirement analysis , specifies system function requirement and non - functional requirements , then designs the system framework and relevant system architecture , and finally designs each function module in detail and realizes it . In order to improve the user experience , the paper combines the Lucene text matching algorithm with the PageRank algorithm and the Lucene search core processing module . In order to improve the user experience , this paper combines the Lucene text matching algorithm and the PageRank algorithm , and considers the influence of time factors on the news search engine , and puts forward an improved page ordering algorithm . On the basis of this , a new algorithm implementation scheme based on Lucene and Hadoop distributed storage and distributed computing is designed and implemented , so that the search results presented to the user are more accurate and more reasonable .
【学位授予单位】:中山大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3;TP311.52
【参考文献】
相关期刊论文 前2条
1 段淮川;胡平;;基于主题特征和时间因子的改进PageRank算法[J];计算机工程与设计;2010年04期
2 王春花;朱俊平;;改进的非平均传递权值PageRank算法[J];计算机工程与设计;2010年10期
本文编号:1457538
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1457538.html