多资源服务器协同环境下的HTTP流量分析

发布时间：2019-01-26 20:08

【摘要】：数年以前,基于HTTP的网络业务由若干服务提供商以中央集中的方式提供,鲜有分布式服务器的存在。通常的情况是,单一服务器提供独有的网络服务,并且固定在某个IP地址上。现如今,网络结构日益复杂,IP地址与其提供的内容及服务开始变得动态化和复杂化：运营商大量使用内容分发网络(CDN, Content delivery network)、内容缓存,基于云的网络服务不断涌现,服务提供商与承载服务的基础设备之间耦合程度正在减弱,所有这些都使得网络管理更加困难。在如此形势下,运营商迫切需要把握HTTP流量构成及使用模式,搞清HTTP流量在不同服务提供商间的分布,以便合理配置网络资源。与此同时,由于网络流量的剧增,传统的流量分析方法已无法满足海量数据的存储和处理要求,需要引入更高效、更可靠的方式进行处理。Hadoop正是一个能够对海量数据进行可靠的分布式处理的可扩展开源软件框架,并已经被应用于越来越多的研究领域。本文首先介绍了基于关联规则的HTTP流量分析算法,利用jaccard系数衡量流量相关性并给出数学描述。随后,本文介绍了Hadoop的基本原理,并在Hadoop技术的基础上提出了HTTP流量分析系统的三层体系结构,将网络流量的采集、存储、处理和分析等独立的功能整合到一起,形成具备完整功能的处理系统。接着,本文对前述系统数据层的IP地址识别组件进行了重点介绍。此组件实现了服务器IP地址向服务提供商的映射,是本文所述HTTP流量分析系统最重要的组成部分。最后,利用系统采集层和数据层的处理的中间结果,本文在HTTP流量分析应用层总结了HTTP流量分布规律。
[Abstract]:A few years ago, the network service based on HTTP was provided by several service providers in a centralized way, and there were few distributed servers. Typically, a single server provides a unique network service and is fixed to a IP address. Nowadays, with the increasing complexity of network structure, IP addresses and their contents and services are becoming more and more dynamic and complicated: operators use a lot of content to distribute network (CDN, Content delivery network), content cache, and cloud-based network services continue to emerge. The coupling between service providers and the infrastructure that hosts the services is decreasing, all of which make network management more difficult. In such a situation, operators urgently need to grasp the HTTP traffic structure and usage mode, to find out the distribution of HTTP traffic among different service providers, in order to allocate network resources reasonably. At the same time, due to the rapid increase of network traffic, the traditional traffic analysis method can no longer meet the storage and processing requirements of massive data, so it is necessary to introduce more efficient. Hadoop is a scalable open source software framework capable of reliably distributed processing massive data and has been used in more and more research fields. This paper first introduces the HTTP traffic analysis algorithm based on association rules, and uses the jaccard coefficient to measure the traffic correlation and gives the mathematical description. Then, this paper introduces the basic principle of Hadoop, and puts forward the three-layer architecture of HTTP traffic analysis system based on Hadoop technology, which integrates the independent functions of network traffic collection, storage, processing and analysis. Form a complete function of the processing system. Then, this paper focuses on the IP address recognition component of the system data layer. This component realizes the mapping of server IP address to service provider and is the most important component of HTTP traffic analysis system described in this paper. Finally, using the intermediate results of the system collection layer and the data layer, this paper summarizes the HTTP traffic distribution law in the HTTP traffic analysis application layer.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP393.06

【参考文献】