基于DFI流量分类技术研究与实现
本文选题:网络流量分类 + 不平衡分类 ; 参考:《东南大学》2015年硕士论文
【摘要】:网络流量分类不仅可以帮助互联网服务提供商提供服务质量保障,而且可以对网络进行有效的监督管理,确保网络安全。随着互联网的迅猛发展,新兴业务层出不穷,私有协议和加密应用的广泛使用,使得DPI分类方法的适用范围越来越小。DFI方法主要通过流的统计特征来识别流量,无需解析应用层负载,处理速度快,对加密报文和隐私协议仍然有效,并且无需额外的设备开销。目前,基于DFI的机器学习流量分类方法具有良好的应用前景。但是,该方法通常以获得高的总体分类准确率为优化目标,而忽略网络流量数据所具有的多类不平衡特性,使得分类性能往往偏向大类,而忽视小类。在网络流量中,有些小类属于重量级应用,占有大量的字节,其分类性能关乎网络规划及带宽资源分配;有些小类应用属于命令流,实时通信流等,其分类性能关乎通信的可靠性及服务质量。此外,由于网络流量特征随着时间和环境的变化而发生改变,现有的分类方法很难保持稳定的分类性能,如何有效应对网络流量的概念漂移问题也是目前研究的热点。本文的研究工作围绕以上目标展开,研究基于深度流检测(Deep Flow Inspection, DFI)的流量分类方法。论文主要内容如下:1.特征选择算法对网络流量分类的影响:在目前基于DFI检测的网络流量识别中,测度属性的选取尤其重要。由于测度属性中包含冗余与不相关特征,使得流量分类具有很高的计算复杂度与空间复杂度。而特征选择算法能依据一定的评估策略选择出更能区分流量类别的属性,通过降低属性的维度来降低计算复杂度和空间复杂度,并提高分类和识别的准确率。本文提出基于选择性集成和改进序列向前搜索的混合特征选择算法,并且与传统的基于相关性的FCBF,信息增益InfoGain, GainRatio,基于统计的Chi-square以及基于一致性Consisitency进行比较,实验结果表明该特征选择算法可以更好的区分流行为特征属性和类属性之间的相关性。2.基于代价敏感的算法模型:由于网络流数据存在类不平衡特性,且目前的流分类算法多偏向大类,忽视小类。为了提高小类的分类性能,本文提出了一种基于重采样的代价敏感模型。首先对不平衡网络流量数据进行SMOTE重采样,改善大类与小类的不平衡特性,然后采用AdaCost算法分类流量数据,其中AdaCost中代价矩阵采用基于权重的错分代价矩阵。并且与传统的C4.5分类算法进行比较,实验结果表明该算法模型可以提高小类的流准确率和字节准确率。3.基于代价敏感的多分类器的算法模型:由于网络流量特征随着时间和环境的变化而发生改变,机器学习分类方法很难保持稳定的分类性能。为了提高分类器的自适应能力,本文提出一种基于精度权重的流量分类方法,实验结果表明该算法在处理流量的概念漂移问题上表现出较好的分类性能和泛化能力。为了进一步提高动态环境下小类的分类性能,本文在基于精度权重的集成学习分类方法的基础上提出了一种基于代价敏感的集成学习模型,模型由两部分构成:第一部分是混合特征选择,获取稳定的最优特征子集,第二部分将基于精度权重的分类方法与基于权重的AdaCost方法相结合。实验结果表明该方法能够有效提高概念漂移环境下小类的流准确率与字节准确率。
[Abstract]:Network traffic classification can not only help the Internet service providers to provide quality assurance, but also can effectively supervise and manage the network to ensure network security. With the rapid development of the Internet, the emerging services emerge in endlessly, and the wide use of private protocols and encryption applications makes the application of the DPI classification method more and more applicable. The small.DFI method is mainly used to identify traffic through the statistical features of the stream. It does not need to parse the load of the application layer and is fast in processing speed. It is still valid for encrypted messages and privacy protocols, and without additional equipment overhead. At present, the machine learning flow classification method based on DFI has a good application prospect. However, this method usually obtains high total. The classification accuracy is the goal of optimization, while ignoring the multi class unbalance characteristic of network traffic data, the classification performance tends to be biased to the large class, but neglects the small class. In the network traffic, some small classes belong to heavy applications and occupy a large number of bytes, and their classification performance is related to network planning and bandwidth allocation; some small classes are applied. It belongs to the command stream, the real-time communication flow and so on. Its classification performance is related to the reliability and the quality of the service. In addition, the current classification method is difficult to maintain a stable classification performance because of the change of network traffic characteristics along with the change of time and environment. How to deal with the concept drift problem of network traffic is also the hot research. The research work around the above aims to study the flow classification method based on Deep Flow Inspection (DFI). The main contents of this paper are as follows: 1. the influence of the feature selection algorithm on network traffic classification: in the current network traffic recognition based on DFI detection, the selection of measure attributes is especially important. The degree attribute contains redundancy and unrelated features, making the traffic classification with high computational complexity and space complexity. The feature selection algorithm can select the attributes that can distinguish the flow category more according to a certain evaluation strategy, and reduce the complexity and complexity by reducing the dimension of the attribute, and improve the classification and recognition. In this paper, a hybrid feature selection algorithm based on selective integration and improved sequence forward search is proposed, and compared with the traditional correlation based FCBF, information gain InfoGain, GainRatio, statistical Chi-square and consistency based Consisitency, the experimental results show that the feature selection algorithm can be better. .2. based on the cost sensitive algorithm model that distinguishes the correlation between the characteristic attributes and the class attributes: because the network flow data has the class imbalances, and the current flow classification algorithms tend to tend to large classes and ignore the small classes. In order to improve the classification performance of the small classes, a cost sensitive model based on resampling is proposed in this paper. The unbalanced network traffic data is resampling with SMOTE to improve the unbalance characteristics of large class and small class. Then AdaCost algorithm is used to classify traffic data, and the cost matrix in AdaCost is based on the weight based misdivision cost matrix. And compared with the traditional C4.5 classification algorithm, the experimental results show that the algorithm can improve the small class. Flow accuracy and byte accuracy.3. based on the cost sensitive multi classifier algorithm model: because the network traffic characteristics change with time and environment changes, the machine learning classification method is difficult to maintain the stable classification performance. In order to improve the adaptive energy of the classifier, a flow based on the precision weight is proposed in this paper. The experimental results show that the algorithm shows better classification performance and generalization ability to deal with the concept drift of traffic. In order to further improve the classification performance of small classes in dynamic environment, this paper proposes a cost sensitive integration based on the integrated learning classification method based on precision weight. The model is composed of two parts: the first part is the mixed feature selection to obtain the stable optimal feature subset. The second part combines the classification method based on the precision weight and the weight based AdaCost method. The experimental results show that the method can effectively improve the flow accuracy and byte accuracy of the small classes in the conceptual drift environment.
【学位授予单位】:东南大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP393.06
【相似文献】
相关期刊论文 前10条
1 卜华龙;夏静;韩俊波;;特征选择算法综述及进展研究[J];巢湖学院学报;2008年06期
2 王博;贾焰;田李;;基于类标号扩展的半监督特征选择算法[J];计算机科学;2009年10期
3 陈红;郭躬德;;一种新的双重融合的半监督特征选择算法[J];小型微型计算机系统;2010年08期
4 张晓光;孙正;徐桂云;阮殿旭;;一种类内方差与相关度结合的特征选择算法[J];哈尔滨工业大学学报;2011年03期
5 陈建华;王治和;蒋芸;许虎寅;樊东辉;;一种改进的文本分类特征选择算法[J];微电子学与计算机;2011年12期
6 郭磊;王亚弟;陈庶樵;朱珂;伊鹏;;一种改进的动态流特征选择算法[J];计算机工程与应用;2012年18期
7 代琨;于宏毅;马学刚;李青;;基于支持向量机的特征选择算法综述[J];信息工程大学学报;2014年01期
8 毛勇;周晓波;夏铮;尹征;孙优贤;;特征选择算法研究综述[J];模式识别与人工智能;2007年02期
9 任双桥;傅耀文;黎湘;庄钊文;;基于分类间隔的特征选择算法[J];软件学报;2008年04期
10 李勇明;张素娟;曾孝平;覃剑;韩亮;;轮询式多准则特征选择算法的研究[J];系统仿真学报;2009年07期
相关会议论文 前5条
1 甄超;郑涛;许洁萍;;音乐流派分类中特征选择算法研究[A];第18届全国多媒体学术会议(NCMT2009)、第5届全国人机交互学术会议(CHCI2009)、第5届全国普适计算学术会议(PCC2009)论文集[C];2009年
2 张仰森;曹元大;;最大熵建模方法中一种改进的特征选择算法[A];NCIRCS2004第一届全国信息检索与内容安全学术会议论文集[C];2004年
3 张铮;胡社教;江萍;;基于EP模式的特征选择算法[A];2011中国仪器仪表与测控技术大会论文集[C];2011年
4 周炎涛;唐剑波;王家琴;;基于信息熵的改进TFIDF特征选择算法[A];第二十六届中国控制会议论文集[C];2007年
5 李文法;段m#毅;刘悦;孙春来;;一种面向流分类的特征选择算法[A];第四届全国信息检索与内容安全学术会议论文集(上)[C];2008年
相关博士学位论文 前6条
1 王剑桥;基于局部特性的毫米波距离像识别方法研究[D];南京理工大学;2014年
2 李云;特征选择算法及其在基于内容图像检索中的应用研究[D];重庆大学;2005年
3 刘华文;基于信息熵的特征选择算法研究[D];吉林大学;2010年
4 张靖;面向高维小样本数据的分类特征选择算法研究[D];合肥工业大学;2014年
5 潘巍巍;故障严重程度识别的有序分类特征分析方法[D];哈尔滨工业大学;2013年
6 张嘉伟;心电图形态特征的识别及其在分类中的作用研究[D];华东师范大学;2011年
相关硕士学位论文 前10条
1 徐娇;基于Hadoop的文本特征选择算法的研究[D];兰州大学;2015年
2 朱艳玲;基于聚类的多层特征选择算法的研究与应用[D];郑州轻工业学院;2015年
3 周p,
本文编号:1845973
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1845973.html