面向不均衡数据的半监督网络流量分类技术研究与实现

发布时间：2018-12-20 07:52

【摘要】：网络流量分类技术作为网络业务管控、网络安全以及网络的建设升级、运营管理等课题的基础,其研究具有重要的应用价值。随着网络技术的飞速发展,网络用户数量急剧膨胀,网络规模迅速扩大,新业务不断涌现,导致网络环境日趋复杂,使得对网络流量的准确分类变的越来越困难。尤其是随着以动态端口号和业务加密为代表的网络技术的广泛应用,传统的基于端口号和载荷特征匹配的流量识别方法的有效性和可靠性下降,研究者们将研究重点放到了基于机器学习的流量分类方法。此类方法根据流的统计特征进行分类,摆脱了对端口号及数据载荷的依赖,有更加广泛的发展前景。本文针对基于机器学习流量分类领域中的样本标注瓶颈和类不均衡两个关键问题展开了研究。主要完成工作如下:1.针对流量分类中的样本标注瓶颈问题和类不均衡问题,提出一种基于K均值和k近邻的半监督流量分类算法(semi-supervised traffic identification method based on K-means and k-nearest neighbor,KMkNN)。该方法以高维流统计特征矢量表征数据流,采用K均值和k近邻分类算法构建两级分类器。首先采用K均值聚类算法将包含少量标记样本和大量未标记样本的数据聚成若干簇;然后,利用簇中标记样本训练k近邻分类器对簇内未知样本分类,并基于已标记样本分布自适应调整近邻数k,从而克服了传统半监督流量分类方法分类结果倾向于大类,小类样本识别率低甚至难以被发现的问题。理论分析和实验结果都表明,该方法面对非均衡协议流时在保持大类流具有较高识别率的同时提高了小类流的识别率,且能够发现新应用。2.针对流统计特征存在冗余、可划分为多个相对独立的特征子集的情况,提出一种基于随机特征子集的集成流量分类算法(ensemble classifier based on random subspace,RSEC)。该算法首先采用基于前向选择的wrapper方式进行特征选择构建特征集合,然后采用分阶随机选择的方法生成特征子集,进而根据不同的特征子集训练获得不同的基分类器,最后采取绝对多数与相对多数相结合的投票方式集成各个基分类器的分类结果得到最终集成结果。实验结果表明该算法对大类和小类流量的识别准确率和召回率相对于单分类器KMkNN有了进一步提升。3.结合实际的网络环境,设计了一种基于机器学习的离线流量分类系统,并采用C#语言编程实现。系统利用wireshark软件实现在线数据采集并保存到本地,用于离线分析;流特征集生成模块根据五元组信息对流进行还原,并通过统计报文头部信息得到流特征;样本标注模块结合端口号匹配、载荷特征匹配和手工标注等手段标注训练样本;分类模块提供了C4.5、NBK、半监督K-means以及本文提出的KMkNN、RSEC共五类可选的分类算法;最后利用实验室采集的真实数据对系统进行测试,验证了系统的有效性。
[Abstract]:Network traffic classification technology is the basis of network management and control, network security, network construction and upgrading, operation management and so on, and its research has important application value. With the rapid development of network technology, the number of network users expands rapidly, the scale of network expands rapidly, and new services emerge constantly. As a result, the network environment is becoming more and more complex, and it is becoming more and more difficult to classify network traffic accurately. Especially, with the wide application of network technology represented by dynamic port number and service encryption, the effectiveness and reliability of the traditional traffic identification method based on port number and load feature matching are decreased. The researchers focused their research on traffic classification based on machine learning. This kind of method can be classified according to the statistical characteristics of the stream, and it can get rid of the dependence on the port number and data load, so it has a wider development prospect. In this paper, two key problems of sample tagging bottleneck and class imbalance in the field of traffic classification based on machine learning are studied. The main work is as follows: 1. A semi-supervised traffic classification algorithm (semi-supervised traffic identification method based on K-means and k-nearest neighbor,KMkNN) based on K-means and k-nearest neighbors is proposed to solve the bottleneck problem and class imbalance problem in traffic classification. The data stream is represented by high dimensional flow statistical feature vector, and a two-level classifier is constructed by using K-means and k-nearest neighbor classification algorithms. Firstly, K-means clustering algorithm is used to cluster the data containing a small number of labeled samples and a large number of unlabeled samples into several clusters. Then, k-nearest neighbor classifier is used to train k-nearest neighbor classifier to classify unknown samples in the cluster, and based on the distribution of labeled samples, the nearest neighbor number k is adjusted adaptively, which overcomes the traditional semi-supervised traffic classification method. Small class sample recognition rate is low or even difficult to find the problem. The theoretical analysis and experimental results show that the proposed method not only maintains a high recognition rate of large class flows, but also improves the recognition rate of small class flows, and can find new applications. 2. An integrated traffic classification algorithm (ensemble classifier based on random subspace,RSEC) based on stochastic feature subsets is proposed to solve the problem that there is redundancy in flow statistics and can be divided into several independent feature subsets. In this algorithm, feature sets are constructed by feature selection based on forward selection (wrapper), then feature subsets are generated by hierarchical random selection, and different base classifiers are obtained by training different feature subsets. Finally, the final result is obtained by combining absolute majority and relative majority to integrate the classification results of each base classifier. Experimental results show that the recognition accuracy and recall rate of the proposed algorithm for large and small class traffic are further improved compared with single classifier KMkNN. An off-line traffic classification system based on machine learning is designed and implemented in C # language. The system uses wireshark software to realize the on-line data acquisition and save to the local for off-line analysis, the flow feature set generation module restores according to the five-tuple information convection, and obtains the flow feature through the statistical message header information. The sample tagging module uses port number matching, load feature matching and manual marking to mark the training samples, and the classification module provides five optional classification algorithms, C4.5 NBK, semi-supervised K-means and KMkNN,RSEC proposed in this paper. Finally, the validity of the system is verified by using the real data collected in the laboratory.
【学位授予单位】：解放军信息工程大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.06

【参考文献】