当前位置:主页 > 科技论文 > 搜索引擎论文 >

分布式在线旅游搜索爬虫系统设计与实现

发布时间:2018-05-04 12:21

  本文选题:搜索引擎 + 在线旅游 ; 参考:《北京邮电大学》2013年硕士论文


【摘要】:随着Internet技术和旅游业的蓬勃发展,特别是近年来人们生活水平的提高以及在线旅游业的兴起,越来越多的用户倾向于网上订购在线旅游线路出游。由于在线旅游线路网页的急剧增多,在线旅游搜索引擎己经成为当前搜索引擎发展的一个重要的研究方向。 本文首先介绍了分布式在线旅游搜索爬虫系统的研究背景及意义、网络爬虫的研究现状等知识,结合搜索引擎的工作原理以及分布式网络爬虫的相关技术和策略,对本系统中需要用到的关键技术做了详细的分析与研究,其中重点研究了分布式任务分配策略及粒度选择、URL去重技术和在线旅游线路网页的更新策略,并根据旅游线路网页的特点,提出了一个专门针对在线旅游线路网页的判别算法。 在以上这些关键技术和策略的基础上,设计并实现了一个以用户对在线旅游线路网页搜索需求作为研究背景,以互联网上的旅游度假平台网站和普通旅行社网站内容作为旅游线路信息的采集对象的分布式在线旅游搜索爬虫系统。在系统设计部分,本文按照实现的功能将分布式在线旅游搜索爬虫系统划分成了四个主要模块,分别为控制服务器、爬虫服务器、索引检索服务器以及数据库模块,并对各个模块的结构进行了详细的描述,同时给出了类图设计。最后,详细介绍了控制服务器和爬虫服务器的实现过程,并使用JAVSA作为开发语言,以T0mcat+Apache+MySQL作为开发环境,实现了整个系统。 为了验证整个分布式爬虫系统的可行性,本文在最后部分使用了5台服务器搭建运行测试环境,对系统进行了功能和性能测试。通过对在线旅游线路网页判别算法准确性进行测试,结果表明该算法能有效地判别一个网页是否为在线旅游线路网页,其准确率达到了90%左右。运行测试结果表明,本文所设计的分布式在线旅游搜索爬虫系统无论是以单台服务器运行,还是整体运行,都能够稳定、高效地采集在线旅游线路网页信息,并根据线路标题建立倒排索引,使用户可以通过一个基于WEB的图形界面方便地检索到所需要的旅游线路信息,达到了设计的目标,对旅游业的信息化有着重要的实际应用价值。
[Abstract]:With the rapid development of Internet technology and tourism, especially the improvement of people's living standard and the rise of online tourism in recent years, more and more users tend to order online travel routes. Because of the rapid increase of online travel route web pages, online tourism search engine has become an important research direction of the current search engine development. This paper first introduces the research background and significance of distributed online tourism search crawler system, the status quo of web crawler research, combined with the working principle of search engine, as well as the related technology and strategy of distributed web crawler. The key technologies used in this system are analyzed and studied in detail, in which the distributed task allocation strategy, the granularity selection URL removal technology and the updating strategy of online travel route web pages are emphatically studied. According to the characteristics of the travel route web page, a discriminant algorithm for the online travel route page is proposed. On the basis of the above key technologies and strategies, we design and implement a research background based on the users' search requirements for online travel route web pages. A distributed online travel search crawler system, which takes the content of tourist vacation platform on the Internet and the content of common travel agency website as the object of collecting information of travel route. In the part of system design, this paper divides the distributed online tourism search crawler system into four main modules according to the function, which are control server, crawler server, index retrieval server and database module. The structure of each module is described in detail, and the class diagram design is given. Finally, the realization process of control server and crawler server is introduced in detail, and the whole system is realized by using JAVSA as the development language and T0mcat Apache MySQL as the development environment. In order to verify the feasibility of the whole distributed crawler system, in the last part of this paper, five servers are used to build the running test environment, and the function and performance of the system are tested. The accuracy of the algorithm is tested. The results show that the algorithm can effectively distinguish whether a web page is an online tourism page, and the accuracy is about 90%. The running test results show that the distributed online travel search crawler system designed in this paper, whether running on a single server or as a whole, can steadily and efficiently collect the information of online travel route web pages. The inverted index is built according to the title of the line, so that the user can easily retrieve the needed information of the tour route through a graphical interface based on WEB, which achieves the goal of the design, and has important practical application value for the information of tourism industry.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3

【参考文献】

相关期刊论文 前7条

1 刘世涛;;简析搜索引擎中网络爬虫的搜索策略[J];阜阳师范学院学报(自然科学版);2006年03期

2 尹江;尹治本;黄洪;;网络爬虫效率瓶颈的分析与解决方案[J];计算机应用;2008年05期

3 姚树宇,赵少东;一种使用分布式技术的搜索引擎[J];计算机应用与软件;2005年10期

4 徐娜;刘四维;汪翔;倪卫明;;基于Bloom Filter的网页去重算法[J];微型电脑应用;2011年03期

5 傅向华,冯博琴,马兆丰,何明;可在线增量自学习的聚焦爬行方法[J];西安交通大学学报;2004年06期

6 陈璐;;我国旅游电子商务的发展现状及对策分析[J];中国商贸;2012年02期

7 王海霞;;我国旅游电子商务发展分析[J];中国证券期货;2011年10期

相关硕士学位论文 前3条

1 苏旋;分布式网络爬虫技术的研究与实现[D];哈尔滨工业大学;2006年

2 罗兵;支持AJAX的互联网搜索引擎爬虫设计与实现[D];浙江大学;2007年

3 左军;基于Lucene网络视频垂直搜索系统的设计与实现[D];北京邮电大学;2007年



本文编号:1843041

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1843041.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户00cb8***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com