基于机器学习的网络流量识别方法与实现

发布时间：2018-04-21 22:32

本文选题：网络流量识别 + 机器学习　；参考：《山东大学》2014年硕士论文

【摘要】：随着计算机网络技术的飞速发展和信息时代的到来,网络使用频率的不断增加造成了互联网的数据流量爆发式增长；网络新应用的不断出现造成了网络通信协议使用更加灵活、混杂；网络病毒、窃听和恶意攻击等行为不断增多造成了网络安全成为社会和政府部门关注的热点。这些问题可以通过网络流量识别得到很好的解决。因此,网络流量识也越来越受到人们的重视。已经有许多不同的流量识别方法,但从研究和应用角度人们越来越关注流量识别的可行性和有效性,即如何快速地处理海量的数据和如何正确地识别网络中的各种应用。面临不断变化的网络环境,本论文主要研究基于机器学习(Machine Learning, ML)的网络流量识别方法,重点采用了后向传播(Back Propagation, BP)神经网络和支持向量机(Support Vector Machine, SVM)两种监督学习算法。 BP神经网络采用分布、并行的网状结构进行训练学习,使其容错性更高,处理速度更快；BP神经网络具有很好的非线性映射能力,可以模拟输入与输出的非线性关系；同时,BP神经网络是通过全局寻优的方式进行训练的,因此BP网络也具有很高的泛化能力。SVM则是针对小样本的机器学习方法,并且通过内积核函数将低维样本空间非线性映射到高维空间,其具有比较完善的理论基础。SVM采用“转导推理”(Transductive Inference)方法可以很容易的解决非线性多分类问题。SVM的最优分类超平面只由边界上有限的支持向量构成,使得SVM方法不仅简单有效,而且具有很好的鲁棒性。这两种机器学习算法都能够适应网络环境中的大数据和多样性,都能够快速有效的识别网络流量的应用类型。本论文的流量识别系统是以家庭中的网络流为识别对象,该系统从功能上分为家庭网关和后台服务器两部分。家庭网关实时抓取数据包、提取特征,并通过机器学习的方法进行流量识别,然后将识别结果传送给后台服务器；后台服务器将识别结果存入数据库,并显示当前网络中流量的应用类型,便于管理者进行监管。论文研究的主要贡献如下： 1、通过对网络流量识别和机器学习的研究与分析,BP神经网络能够适应互联网的大数据和多样性特点,在此基础上选择了基于BP神经网络的流量识别方法。即选择三层的BP神经网络作为实现方案,其分类能力满足流量识别的要求并且结构简单易于实现。选择S型函数作为BP神经网络隐含层的转移函数,实现对网络流特征等输入信息的非线性映射。虽然BP神经网络容易陷入误差曲面的局部极小,但是通过粒子群算法(Particle Swarm Optimization, PSO)寻找具有全局最优特性的初始化权值,保证BP神经网络训练时能够进入误差曲面的全局最小。实验结果表明,经过PSO算法优化的BP神经网络能够很快寻找到误差曲面的全局最小值,并准确识别流量的网络应用类型。 2、仔细研究SVM解决线性和非线性分类问题的原理,在此基础上提出了基于SVM的流量识别方法,将SVM应用于网络流量识别领域。选择径向基函数作为SVM的核函数,实现从低维的网络流特征空间向更高维空间的非线性映射。并通过一对一方法(One-Against-One)构造了SVM多值分类器,使SVM能够识别多种网络应用类型。SVM在高维空间中生成最优超平面,实现对空间的划分和多种网络应用的分类,这是一种全局寻优的方式因此SVM的识别方法具有很好的泛化能力。实验结果表明,SVM非常适合解决网络流量识别这种非线性多分类问题,而且所需训练样本少,计算复杂度低,能够进行实时识别。 3、在家庭局域网中设计和实现了流量识别系统。根据机器学习的系统模型和监督学习的实现方法,设计了网络流量识别的总体架构,将其分为实时在线流量识别和离线训练学习两部分,具体过程包含抓取网络流的数据包,生成网络流的特征,选择训练集和测试集,对机器学习算法进行训练,和测试两种流量识别算法的分类效果。在系统实现方面,将BP神经网络和SVM的流量识别算法编写为程序,并移植到家庭网关(家庭网关由路由器搭建)中。在后台服务器的Linux平台上搭建Web服务器和安装MySQL数据库,实现家庭网关与后台服务器之间的交互通信、信息处理和存储。管理员则可以通过Web浏览器登录后台服务器观察当前家庭网络中流量识别结果。
[Abstract]:With the rapid development of computer network technology and the arrival of information age, the increasing frequency of network use has caused the explosive growth of the data flow of the Internet. The continuous emergence of new network applications caused the use of network communication protocols to be more flexible and mixed; network viruses, eavesdropping and malicious attacks have been increasing. Network security has become a hot spot of concern in the society and government departments. These problems can be solved well through network traffic identification. Therefore, the network traffic knowledge is also getting more and more attention.
There are many different traffic identification methods, but from the perspective of research and application, people pay more and more attention to the feasibility and effectiveness of traffic identification, that is, how to deal with massive data quickly and how to correctly identify various applications in the network. Facing the changing network environment, this paper mainly studies Machine L based on machine learning. Earning, ML) network traffic identification method, focusing on the backward propagation (Back Propagation, BP) neural network and support vector machine (Support Vector Machine, SVM) of the two supervised learning algorithms.
BP neural network adopts distributed and parallel network structure for training and learning, which makes it more fault-tolerant and faster processing; BP neural network has good nonlinear mapping ability and can simulate the nonlinear relationship between input and output. At the same time, BP neural network is trained through global optimization, so BP network also has The high generalization ability.SVM is a machine learning method for small sample, and maps the low dimensional sample space nonlinear to the high dimension space through the inner product kernel function, and it has a relatively perfect theoretical basis,.SVM can easily solve the nonlinear multi classification problem.SVM using the "Transductive Inference" method. The optimal classification hyperplane is only composed of finite support vectors on the boundary, which makes the SVM method not only simple and effective, but also has good robustness. These two machine learning algorithms can adapt to the large data and diversity in the network environment, and can quickly and effectively identify the application types of network flow.
The flow recognition system in this paper is based on the network flow in the family, which is divided into two parts: the home gateway and the backstage server. The home gateway takes the data packet in real time, extracts the features, and carries out the traffic identification through the machine learning method, and then transmits the recognition results to the backstage server; the background server is transferred to the background server. Storing the results in the database and displaying the application types of traffic in the current network is convenient for managers to supervise. The main contributions of the paper are as follows:
1, through the research and analysis of network traffic identification and machine learning, the BP neural network can adapt to the large data and diversity characteristics of the Internet. On this basis, we choose the flow recognition method based on the BP neural network. That is, the three layer BP neural network is selected as the implementation scheme, and its classification ability meets the requirements of traffic identification and the conclusion is concluded. The S type function is selected as the transfer function of the hidden layer of the BP neural network to realize the nonlinear mapping of the input information such as the network flow characteristics. Although the BP neural network is easy to fall into the local minimum of the error surface, the global optimal characteristic is found by the particle swarm optimization (Particle Swarm Optimization, PSO). The initial weight value ensures that the BP neural network is trained to enter the global minimum of the error surface. The experimental results show that the BP neural network optimized by the PSO algorithm can quickly find the global minimum value of the error surface and identify the network application type of the flow accurately.
2, the principle of SVM to solve linear and nonlinear classification problems is carefully studied. On this basis, a flow recognition method based on SVM is proposed, and SVM is applied to the field of network traffic identification. The radial basis function is selected as the kernel function of the SVM to realize the nonlinear mapping from the characteristic space of the low dimension network flow to the higher dimension space. Method (One-Against-One) constructs a SVM multi value classifier, which enables SVM to identify a variety of network application types.SVM to generate the optimal hyperplane in high dimensional space to realize the partition of space and the classification of various network applications. This is a global optimization method, so the SVM recognition method has a good generalization ability. The experimental results show that SVM is not. It is often suitable for solving the nonlinear multi class problem of network traffic identification. Moreover, it needs less training samples and low computational complexity, and can be used for real-time identification.
3, the flow recognition system is designed and implemented in the home LAN. According to the system model of machine learning and the realization method of supervised learning, the overall architecture of network traffic identification is designed, which is divided into two parts: real-time online traffic identification and off-line training learning. The specific process includes data packets grabbing network flow and generating network flow. Feature, select the training set and test set, train the machine learning algorithm, and test the classification effect of two traffic recognition algorithms. In the system realization, the BP neural network and the SVM traffic recognition algorithm are programmed and transplanted into the home gateway (the home gateway is built by the road device). On the Linux platform of the backstage server, it is built on the backstage server. Build Web server and install MySQL database to realize interactive communication between home gateway and backstage server, information processing and storage. Administrators can log in to backstage server through Web browser to observe current traffic identification results in home network.

【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.08;TP181

【参考文献】