基于DFI流量分类技术研究与实现

发布时间：2018-05-05 03:23

本文选题：网络流量分类 + 不平衡分类　；参考：《东南大学》2015年硕士论文

【摘要】：网络流量分类不仅可以帮助互联网服务提供商提供服务质量保障,而且可以对网络进行有效的监督管理,确保网络安全。随着互联网的迅猛发展,新兴业务层出不穷,私有协议和加密应用的广泛使用,使得DPI分类方法的适用范围越来越小。DFI方法主要通过流的统计特征来识别流量,无需解析应用层负载,处理速度快,对加密报文和隐私协议仍然有效,并且无需额外的设备开销。目前,基于DFI的机器学习流量分类方法具有良好的应用前景。但是,该方法通常以获得高的总体分类准确率为优化目标,而忽略网络流量数据所具有的多类不平衡特性,使得分类性能往往偏向大类,而忽视小类。在网络流量中,有些小类属于重量级应用,占有大量的字节,其分类性能关乎网络规划及带宽资源分配；有些小类应用属于命令流,实时通信流等,其分类性能关乎通信的可靠性及服务质量。此外,由于网络流量特征随着时间和环境的变化而发生改变,现有的分类方法很难保持稳定的分类性能,如何有效应对网络流量的概念漂移问题也是目前研究的热点。本文的研究工作围绕以上目标展开,研究基于深度流检测(Deep Flow Inspection, DFI)的流量分类方法。论文主要内容如下：1.特征选择算法对网络流量分类的影响：在目前基于DFI检测的网络流量识别中,测度属性的选取尤其重要。由于测度属性中包含冗余与不相关特征,使得流量分类具有很高的计算复杂度与空间复杂度。而特征选择算法能依据一定的评估策略选择出更能区分流量类别的属性,通过降低属性的维度来降低计算复杂度和空间复杂度,并提高分类和识别的准确率。本文提出基于选择性集成和改进序列向前搜索的混合特征选择算法,并且与传统的基于相关性的FCBF,信息增益InfoGain, GainRatio,基于统计的Chi-square以及基于一致性Consisitency进行比较,实验结果表明该特征选择算法可以更好的区分流行为特征属性和类属性之间的相关性。2.基于代价敏感的算法模型：由于网络流数据存在类不平衡特性,且目前的流分类算法多偏向大类,忽视小类。为了提高小类的分类性能,本文提出了一种基于重采样的代价敏感模型。首先对不平衡网络流量数据进行SMOTE重采样,改善大类与小类的不平衡特性,然后采用AdaCost算法分类流量数据,其中AdaCost中代价矩阵采用基于权重的错分代价矩阵。并且与传统的C4.5分类算法进行比较,实验结果表明该算法模型可以提高小类的流准确率和字节准确率。3.基于代价敏感的多分类器的算法模型：由于网络流量特征随着时间和环境的变化而发生改变,机器学习分类方法很难保持稳定的分类性能。为了提高分类器的自适应能力,本文提出一种基于精度权重的流量分类方法,实验结果表明该算法在处理流量的概念漂移问题上表现出较好的分类性能和泛化能力。为了进一步提高动态环境下小类的分类性能,本文在基于精度权重的集成学习分类方法的基础上提出了一种基于代价敏感的集成学习模型,模型由两部分构成：第一部分是混合特征选择,获取稳定的最优特征子集,第二部分将基于精度权重的分类方法与基于权重的AdaCost方法相结合。实验结果表明该方法能够有效提高概念漂移环境下小类的流准确率与字节准确率。
[Abstract]:Network traffic classification can not only help the Internet service providers to provide quality assurance, but also can effectively supervise and manage the network to ensure network security. With the rapid development of the Internet, the emerging services emerge in endlessly, and the wide use of private protocols and encryption applications makes the application of the DPI classification method more and more applicable. The small.DFI method is mainly used to identify traffic through the statistical features of the stream. It does not need to parse the load of the application layer and is fast in processing speed. It is still valid for encrypted messages and privacy protocols, and without additional equipment overhead. At present, the machine learning flow classification method based on DFI has a good application prospect. However, this method usually obtains high total. The classification accuracy is the goal of optimization, while ignoring the multi class unbalance characteristic of network traffic data, the classification performance tends to be biased to the large class, but neglects the small class. In the network traffic, some small classes belong to heavy applications and occupy a large number of bytes, and their classification performance is related to network planning and bandwidth allocation; some small classes are applied. It belongs to the command stream, the real-time communication flow and so on. Its classification performance is related to the reliability and the quality of the service. In addition, the current classification method is difficult to maintain a stable classification performance because of the change of network traffic characteristics along with the change of time and environment. How to deal with the concept drift problem of network traffic is also the hot research. The research work around the above aims to study the flow classification method based on Deep Flow Inspection (DFI). The main contents of this paper are as follows: 1. the influence of the feature selection algorithm on network traffic classification: in the current network traffic recognition based on DFI detection, the selection of measure attributes is especially important. The degree attribute contains redundancy and unrelated features, making the traffic classification with high computational complexity and space complexity. The feature selection algorithm can select the attributes that can distinguish the flow category more according to a certain evaluation strategy, and reduce the complexity and complexity by reducing the dimension of the attribute, and improve the classification and recognition. In this paper, a hybrid feature selection algorithm based on selective integration and improved sequence forward search is proposed, and compared with the traditional correlation based FCBF, information gain InfoGain, GainRatio, statistical Chi-square and consistency based Consisitency, the experimental results show that the feature selection algorithm can be better. .2. based on the cost sensitive algorithm model that distinguishes the correlation between the characteristic attributes and the class attributes: because the network flow data has the class imbalances, and the current flow classification algorithms tend to tend to large classes and ignore the small classes. In order to improve the classification performance of the small classes, a cost sensitive model based on resampling is proposed in this paper. The unbalanced network traffic data is resampling with SMOTE to improve the unbalance characteristics of large class and small class. Then AdaCost algorithm is used to classify traffic data, and the cost matrix in AdaCost is based on the weight based misdivision cost matrix. And compared with the traditional C4.5 classification algorithm, the experimental results show that the algorithm can improve the small class. Flow accuracy and byte accuracy.3. based on the cost sensitive multi classifier algorithm model: because the network traffic characteristics change with time and environment changes, the machine learning classification method is difficult to maintain the stable classification performance. In order to improve the adaptive energy of the classifier, a flow based on the precision weight is proposed in this paper. The experimental results show that the algorithm shows better classification performance and generalization ability to deal with the concept drift of traffic. In order to further improve the classification performance of small classes in dynamic environment, this paper proposes a cost sensitive integration based on the integrated learning classification method based on precision weight. The model is composed of two parts: the first part is the mixed feature selection to obtain the stable optimal feature subset. The second part combines the classification method based on the precision weight and the weight based AdaCost method. The experimental results show that the method can effectively improve the flow accuracy and byte accuracy of the small classes in the conceptual drift environment.

【学位授予单位】：东南大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP393.06

【相似文献】