基于行为模式的Web Robot检测技术研究

发布时间：2018-04-09 07:49

本文选题：网络爬虫检测　切入点：行为模式　出处：《武汉邮电科学研究院》2017年硕士论文

【摘要】：Web Robot(网络爬虫)是一种能够自动地获取各类互联网资源的程序,自1993年被正式应用后,给普通用户和专业互联网从业人员都带来了便利。伴随着Web Robot的出现,人们才具备在日益增长的互联网数据中进行有目的地检索的能力。而互联网技术不断发展,已经全面地融入到社会的各个方面,互联网上的数据量也在高速增加,为了满足人们不同的需求,网络爬虫技术也在不断更新。通常来说可以分为通用Robot、聚焦型Robot、增量式Robot、Deep Robot、Topic Robot以及分布式Robot。在实际使用中,大型的网络爬虫系统往往会融合几种技术以共同实现,使得其架构和行为变得日益复杂。然而,在其被人们大量地被应用到检索网络信息和资源的同时,也产生了隐患和负面效果。Web Robot会频繁地尝试获取网站上的各类资源,这会影响网站服务器的性能并且会产生信息泄露的风险;其次,爬虫程序对网站的访问会影响网站日志,进而影响基于网站日志的数据挖掘工作的难度和准确度;此外,出于恶意目的(如窥探网站漏洞或窃取网站信息)而设计的Robot程序会造成隐私数据泄露、资源滥用等问题。为了解决这些问题,互联网工作者开发出了许多Web Robot检测技术,使得网站的开发人员能够检测客户端是普通用户还是Robot程序。为了进一步提高对Web Robot的检测效果,弥补现有检测手段的不足,本文采用会话矢量描述Web Robot的行为模式,实现了一种基于Web Robot行为特征的检测算法。主要内容有:通过针对Web Robot的设计原理行为模式等方面的分析,详细介绍了其他检测算法的优劣;介绍了行为矢量的原理,分析方法,及其在各个领域的应用;设计基于支持矢量机的Web Robot检测算法,对其有效性进行分析,并在实验中完成了测试。论文创新点在于:针对网络爬虫的行为特征,对Web日志进行聚类分析,提取出能够标记Web访问会话的特征矢量,并对此做出改进,给出了特征矢量权值的计算方法及改进的权值公式。在基于支持矢量机的爬虫检测算法的基础上设计实现了基于行为模式的爬虫检测系统,并对其系统架构及模块设计进行了详细描述。
[Abstract]:Web Robot (Web crawler) is a program that can automatically access all kinds of Internet resources. Since its formal application in 1993, it has brought convenience to both ordinary users and professional Internet practitioners.With the appearance of Web Robot, people have the ability to retrieve data from the Internet.With the continuous development of Internet technology, it has been fully integrated into all aspects of society, and the amount of data on the Internet is also increasing at a high speed. In order to meet the different needs of people, the technology of web crawler is constantly updated.Generally speaking, it can be divided into general robot, focused robot, incremental robot deep robot topic Robot and distributed robot.In practical use, large web crawler systems tend to integrate several technologies to implement them together, which makes their architecture and behavior more and more complex.However, while it is widely used to retrieve network information and resources, it also produces hidden dangers and negative effects. The web Robot will frequently try to obtain all kinds of resources on the website.This affects the performance of the web server and the risk of information disclosure; secondly, the crawler's access to the site affects the site log, which in turn affects the difficulty and accuracy of the data mining based on the web log.A Robot program designed for malicious purposes (such as peeping into a vulnerability or stealing information from a website) can cause privacy data leaks, resource abuse and so on.In order to solve these problems, Internet workers have developed many Web Robot detection techniques, which enable web developers to detect whether the client is an ordinary user or a Robot program.In order to further improve the detection effect of Web Robot and make up for the deficiency of existing detection methods, this paper uses session vector to describe the behavior pattern of Web Robot, and implements a detection algorithm based on the behavior characteristics of Web Robot.The main contents are as follows: through the analysis of the design principle and behavior pattern of Web Robot, the advantages and disadvantages of other detection algorithms are introduced in detail, the principle of behavior vector, the analysis method and its application in various fields are introduced.The Web Robot detection algorithm based on support vector machine is designed, and its validity is analyzed and tested in the experiment.The innovation of this paper lies in: according to the behavior characteristics of web crawlers, clustering analysis of Web logs is carried out to extract feature vectors that can mark Web access sessions, and make improvements to this.The method of calculating the weight of feature vector and the improved formula of weight are given.Based on the crawler detection algorithm based on support vector machine, the crawler detection system based on behavior pattern is designed and implemented, and the system architecture and module design are described in detail.
【学位授予单位】：武汉邮电科学研究院
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092;TP391.3

【参考文献】