当前位置:主页 > 管理论文 > 移动网络论文 >

网页防抓取系统的设计与实现

发布时间:2018-04-03 06:39

  本文选题:防抓取 切入点:网络爬虫 出处:《哈尔滨工业大学》2015年硕士论文


【摘要】:某公司是中国领先的在线旅游平台,机票搜索交易平台是其中的重要基础平台之一,搜索范围覆盖全球范围内约18余万条航线,可实时搜索4000多家旅游代理商网站,同时其2014年度的机票交易也突破了8000万张。然而在业务量持续增长的同时,机票搜索交易平台及其相关业务系统都面临着各类外部来源的信息抓取所带来的压力,大量的抓取请求带来了一系列严峻的问题:○1数据安全问题,面对非正常的抓取访问的,关键数据存在被竞争对手获取的风险;○2系统性能问题,大量的抓取请求造成服务器资源的耗尽,严重影响用户的搜索和交易体验;○3不同的业务系统重复对防抓取进行实现,且实现质量良莠不齐,形成了资源的浪费。论文通过对网络爬虫和防抓取相关技术的深入研究,设计并实现了网页防抓取系统(Web Anti-Crawling System,ACS)。ACS系统为公司的机票搜索交易平台及其下面的多个业务项目提供了统一的、高质量的防抓取服务,实现了HTTP协议头、JS加密串、IP黑名单、访问频率控制等防抓取策略;通过对机票搜索交易平台业务的深入了解,实现了业务逻辑相关的行为模式防抓取策略,进一步提高了抓取所需的成本;另外,ACS系统对策略接口、防抓取服务接口的设计,使得API接口与实现分离,不仅具有良好的拓展性,同时也降低与业务系统之间的耦合性,便于防抓取服务的接入。Anti-Crawling System为上述由抓取带来的问题提供了一个解决方案。整个防抓取系统经过一定的功能测试和性能测试,确定论文中所述的五个防抓取策略已经可以正常工作,满足系统预期的功能需求;ACS系统与其他业务系统耦合度低,非常易于防抓取服务的接入;同时在性能测试过程中,整个防抓取系统能够稳定地提供服务且能达到预期的性能要求。目前ACS系统已经正式投入实际使用和运行。
[Abstract]:A company is a leading online travel platform in China, and the ticket search and transaction platform is one of the important basic platforms. The search scope covers more than 180,000 routes around the world, and it can search more than 4000 travel agent websites in real time.At the same time, its 2014 air ticket transactions also broke through 80 million.However, while the volume of business continues to grow, ticket search and transaction platforms and their related business systems are facing the pressure of information capture from all kinds of external sources.A large number of fetching requests have brought a series of serious problems: 01 data security problems. Faced with abnormal grab access, critical data has the risk of being acquired by competitors.A large number of crawling requests lead to the exhaustion of server resources, which seriously affect the user's search and transaction experience. Different business systems repeat the implementation of anti-grab, and the quality of the implementation is uneven, resulting in a waste of resources.Based on the deep research of web crawler and anti-grabbing technology, this paper designs and implements the web Anti-Crawling system ACS.ACS system provides a unified platform for the airline ticket search and transaction platform and several business items below it.The high quality anti-grab service realizes the anti-grab strategy of HTTP protocol, such as JS encryption, IP blacklist, access frequency control and so on, through in-depth understanding of the business of air ticket search and transaction platform,In addition, the design of the policy interface and the anti-grab service interface of the API system makes the API interface separate from the implementation.It not only has good expansibility, but also reduces the coupling with the service system. It is convenient to access. Anti-Crawling System to provide a solution for the above problems caused by the grab.After a certain function test and performance test, the whole anti-grab system determines that the five anti-grab strategies mentioned in the paper can work normally, and meet the expected functional requirements of the system, and the coupling degree between ACS system and other business systems is low.It is very easy to access the anti-grab service, and in the process of performance testing, the whole anti-grab system can provide the service stably and meet the expected performance requirements.At present, ACS system has been put into practical use and operation.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP393.092

【参考文献】

相关期刊论文 前4条

1 范纯龙;袁滨;余周华;徐蕾;;基于陷阱技术的网络爬虫检测[J];计算机应用;2010年07期

2 梁雪松;张容;;网络爬虫对网络安全的影响及其对策分析[J];计算机与数字工程;2009年12期

3 周中华;张惠然;谢江;;基于Python的新浪微博数据爬虫[J];计算机应用;2014年11期

4 李璐;张国印;李正文;;基于SVM的主题爬虫技术研究[J];计算机科学;2015年02期

相关硕士学位论文 前4条

1 宋婷;基于SVM的网络爬虫检测研究与实现[D];天津大学;2010年

2 刘啸;基于Cookie欺骗的Session渗透入侵分析及其安全模型研究[D];浙江大学;2003年

3 苏旋;分布式网络爬虫技术的研究与实现[D];哈尔滨工业大学;2006年

4 林乐彬;Inar网络爬虫的设计与实现[D];哈尔滨工业大学;2006年



本文编号:1704044

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1704044.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户5a71b***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com