基于爬虫的Sohu新闻搜索引擎设计与实现

发布时间：2018-01-23 12:29

本文关键词： 搜索引擎排序算法 Lucene PageRank Hadoop　出处：《中山大学》2012年硕士论文　论文类型：学位论文

【摘要】：互联网信息增长速度惊人，为了在海量数据中快速找到有用的信息，搜索引擎技术成为了网民关注的热点。本论文的新闻搜索引擎就是在这样的环境下应运而生。对普通用户来说，商业的搜索引擎基本上能满足其应用需求。但是对于特定的用户来说，譬如中小企业用户或者科研机构等，因为商业搜索引擎信息的针对性较低，同时存在不能按需配置等缺陷，他们的应用需求不能通过商业互联网搜索引擎得到完全满足。Lucene等开源软件的出现很好地满足了这个需求,由于它们是完全开源的，开发人员完全能够根据需求开发出适用于具体领域的搜索引擎。本文系统就是基于开源软件设计并实现的。本文首先介绍了搜索引擎的发展历史、趋势及搜索引擎的分类，然后，阐述了系统需求分析，明确系统功能需求与非功能需求，接着设计系统框架与相关系统体系结构，最后详细设计各个功能模块并将之实现。本系统为基于爬虫的Sohu新闻搜索引擎，使用二次开发方法，实现了Heritrix数据抓取模块，HTMLParser数据预处理模块，Lucene索引与Oracle数据库数据生成模块及Lucene搜索核心处理模块等。为了提高用户体验，，本文结合Lucene文本匹配算法与PageRank算法，并考虑了时间因素对新闻搜索引擎的影响，提出了一种改进的页面排序算法，在此基础上，设计并实现了一种基于Lucene与Hadoop分布式存储与分布式计算的算法实现方案，从而使展现给用户的搜索结果更加准确，更加合理。
[Abstract]:In order to find useful information quickly in mass data , search engine technology has become a hot spot for Internet users . The news search engine of this paper is born in such an environment . For ordinary users , commercial search engines can basically meet their application requirements . But for specific users , such as small and medium - sized enterprises users or scientific research institutions , etc . , because of the low pertinence of commercial search engine information , and the existence of such defects as cannot be configured on demand , their application needs cannot be fully met through commercial Internet search engines . Since they are fully open source , developers can develop search engines suitable for specific fields according to requirements . The system is based on open source software design and realized . This paper first introduces the development history , trend of search engine and the classification of search engine , then expounds the system requirement analysis , specifies system function requirement and non - functional requirements , then designs the system framework and relevant system architecture , and finally designs each function module in detail and realizes it . In order to improve the user experience , the paper combines the Lucene text matching algorithm with the PageRank algorithm and the Lucene search core processing module . In order to improve the user experience , this paper combines the Lucene text matching algorithm and the PageRank algorithm , and considers the influence of time factors on the news search engine , and puts forward an improved page ordering algorithm . On the basis of this , a new algorithm implementation scheme based on Lucene and Hadoop distributed storage and distributed computing is designed and implemented , so that the search results presented to the user are more accurate and more reasonable .

【学位授予单位】：中山大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3;TP311.52

【参考文献】