基于Nutch和Solr的旅游信息垂直搜索引擎的研究和实现
发布时间:2018-07-12 11:12
本文选题:垂直搜索引擎 + 旅游信息 ; 参考:《海南大学》2016年硕士论文
【摘要】:随着网络的迅速发展,万维网成为大量信息的载体,搜索引擎作为人们获取并利用这些信息的重要工具,成为用户访问访问万维网的入口和指南。传统的通用搜索引擎技术不加区分搜罗全网数据,虽然覆盖全面但是也存在结果繁多这一缺点,从而提高了有特定需求用户的筛选成本。垂直搜索引擎仅仅采集某一特定领域相关的页面,可以更加精确、迅速地让用户获取到其关心领域的信息。面向旅游领域的垂直搜索引擎,可以让旅游者、旅游业从业人员等相关人员迅速获取旅游类信息。Nutch是Apache旗下的Java开源网络爬虫,主要用于搜集网页数据,然后对爬取到的网页进行分析,它与开源全文索引框架Solr结合,可以搭建一个搜索引擎系统原型。本课题在研究其基础上,通过改造的相关功能模块,改进相关算法,实现了一个面向旅游领域的垂直搜索引擎。本文的主要研究内容如下:(1)首先,明确研究背景、研究意义,了解搜索引擎的工作原理、发展史以及它的2种分类方式。阐述通用搜索引擎存在的不足以及垂直搜索引擎存在的优势。其次,在分析垂直搜索引擎的关键点后,提出针对旅游信息的主题爬虫模型。(2)垂直搜索引擎与通用搜索引擎最显著的区别就是采集内容的主题性。在选择一定数量的样本文档采用文档频率DF结合人工筛选建立旅游主题词库后,爬取过程中应用主题相关性判定算法结合主题词库对网页进行主题相关性判断,过滤与旅游主题相关性差的网页。(3)在索引过程中引入IK-Analyzer来增强搜索引擎对中文分词的支持,并且扩展其词库,加入主题词库内容,扩充停用词。网页排序算法的优劣与用户查询体验紧密相关,在搜索排序中,采用基于PageRank算法结合主题相关度改进网页评分,使得在网页排序时考虑到页面权威性和主题性这样的因素。(4)借鉴各大搜索引擎的UI设计设计实现良好的用户检索界面,提升用户体验度。(5)在深入了解Nutch和Solr的工作原理、源码实现后,针对旅游领域主题采集这一目标提出自己的创新思路和解决办法,并对其进行二次开发,实现基于Nutch和Solr旅游信息垂直搜索引擎系统。在服务器上,搭建Hadoop分布式平台,并部署系统进行运行与测试。
[Abstract]:With the rapid development of the network, the world wide web has become the carrier of a large number of information. As an important tool for people to obtain and use these information, the search engine has become the entrance and guide for users to access the world wide web. The traditional general search engine technology does not separate the whole network data, although it covers a wide range but also has a wide range of results. It improves the cost of screening for users with specific requirements. Vertical search engines only collect specific domain related pages so that users can get information about their areas of concern more accurately and quickly. Vertical search engines in the tourism field can allow travelers, tourism practitioners and other related personnel to get quickly. The tourist information.Nutch is the Java open source web crawler under Apache, which is mainly used to collect web data, and then analyzes the crawled web pages. It can be combined with the open source full text index framework Solr to build a prototype of the search engine system. A vertical search engine oriented to tourism is implemented. The main contents of this paper are as follows: (1) first, the research background, the research significance, the working principle of the search engine, the history of the development and its 2 types of classification are discussed. The shortcomings of the general search engine and the advantages of the vertical search engine are expounded. Secondly, After analyzing the key points of the vertical search engine, the theme crawler model for tourism information is proposed. (2) the most significant difference between the vertical search engine and the general search engine is the subject nature of the collection of contents. After selecting a certain number of sample documents by using the document frequency DF and the artificial selection of the tourist theme lexicon, the crawling process is used. The application of thematic correlation determination algorithm combined with topic word library to judge the topic relevance of the web page. (3) introducing IK-Analyzer in the index process to enhance the support of the search engine to Chinese word segmentation, and expand its thesaurus, add the content of the thesaurus, expand the disuse words. The web sort algorithm The advantages and disadvantages are closely related to the user's query experience. In the search sorting, the PageRank algorithm is used to improve the web page score based on the correlation degree of the subject. The factors such as the page authority and the theme are taken into account in the web page sorting. (4) learning from the UI design and design of the major search engines to achieve a good user retrieval interface and improve the user experience (5) (5) after a thorough understanding of the working principle of Solr and the realization of the source code, we put forward his own innovative ideas and solutions to the target collection in the tourism field, and carry out two development to realize the vertical search engine system based on Nutch and Solr tourism information. On the server, build the Hadoop distributed platform and deploy the system. Run and test.
【学位授予单位】:海南大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.3
,
本文编号:2116971
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2116971.html