基于重采样的级联分类器入侵检测研究

发布时间：2019-03-11 10:35

【摘要】：随着信息技术的快速发展和网络的普及,互联网已经成为人们工作生活的重要组成部分,同时互联网中恶意信息窃取、人身攻击、非法牟取暴利的行为也大量增长,网络安全问题日益突出,致使网络安全研究的重要性日渐凸显。入侵检测是网络安全领域的研究热点,是一种检测计算机网络或系统中违反安全使用行为的过程。随着信息技术的发展,各类计算机系统的复杂性也呈指数级增长,这给入侵检测带来极大困难。本文通过对网络入侵检测方法的研究,发现常用的入侵检测方法主要致力于于提高整体的检测率,然而却忽视了部分重要类别的检测率,使得R2L(来自远程主机的未授权访问)和U2R(未授权的本地超级用户特权访问)两类攻击行为检测率很低,然而该两类行为入侵成功后均可对服务器资源进行窃取或破坏,因此,提高其检测性能显得刻不容缓。本文首先针对目前常见的主要检测方法,分析了导致R2L和U2R两类攻击检测效果不理想的原因,其主要原因有两点:一是数据分布不平衡,导致分类发生偏斜,其为不平衡分类问题(即训练集中数据分布极其不平衡,某一类或某些类的样本数量远远大于或小于其他类别);二是该两类攻击很难从包头分辨,需要数据包的详细内容信息。通过对常用入侵检测方法的分析与研究,发现他们均采用相同方法检测各类,故很难达到理想效果,而级联多个分类器分别做不同类的分类能有效解决入侵检测中数据分布不平衡问题。入侵检测属于典型的不平衡分类问题,本文系统深入地研究了重采样等不平衡分类方法,针对SMOTE在对入侵检测数据集重采样过程中会产生噪音及边界数据的问题,引入NCL(邻域清理)过滤器方法;提出了改进优化的重采样方法SMOTE-NCL用于过滤掉噪音与边界数据。由于级联分类器方法在解决不平衡分类问题中的优势和在入侵检测中表现的良好效果,本文使用级联分类器进行入侵检测。但考虑到入侵检测数据集中较高的特征维数对检测性能的影响的问题,本文通过引入改进优化的CGFR特征选择方法,分别为级联的分类器选择特征子集。然后将CGFR与SMOTE-NCL应用于级联分类器,在此基础上提出了基于重采样的级联分类器入侵检测模型,以解决现有入侵检测方法中对R2L和U2R两类攻击检测效果不理想的问题。根据理论分析实验,本文选择的级联分类器中的分类方法分别为决策树算法(C4.5)和朴素贝叶斯(NB)算法,模型级联的第一个分类器用于训练Do S(拒绝服务攻击)、Probe(端口扫描)和Normal(正常数据)三类,第二个分类器用于训练Normal、R2L和U2R三类;在检测过程中,测试集首先进入第一个分类器被分类器分类为Normal的数据进入到第二个分类器分类,最终能够完成Do S、Probe、Normal、R2L和U2R五类的分类。实验首先对比了各种特征选择方法与CGFR方法选择的特征子集在级联分类器上的分类结果;然后对比了在原数据集、SMOTE不同采样率的和SMOTE-NCL重采样的数据集上使用级联分类器进行分类的结果;最后对比了在SMOTE-NCL重采样的数据集上使用SVM、KNN、NB、C4.5以及级联分类器方法进行分类的结果;对于U2R和R2L两类攻击,本文提出的基于CGFR和SMOTE-NCL的级联分类器入侵检测模型的AUC值均高于其他情况。但对于R2L的检测效果仍不够理想,这是因为R2L类攻击很难通过包头特征分辨,需要数据包的详细内容特征才能判定,其大量样本包头特征与Normal无异,因此检测效果不理想。要进一步解决该问题,作者考虑应在提取数据时从数据包内容中抽取部分特征,重新动态生成训练集和测试集,这也是本文下一步的工作。
[Abstract]:With the rapid development of the information technology and the popularization of the network, the Internet has become an important part of people's work life, and meanwhile, the malicious information stealing, personal attack and illegal exploitation of the Internet in the Internet also increase, and the problem of network security is becoming more and more serious. The importance of network security research is becoming more and more prominent. Intrusion detection is a hot topic in the field of network security, and it is a process to detect the violation of safe use in computer network or system. With the development of information technology, the complexity of all kinds of computer systems also grows exponentially, which brings great difficulty to the intrusion detection. In this paper, through the research of the network intrusion detection method, it is found that the common intrusion detection method is mainly devoted to the improvement of the overall detection rate, but the detection rate of some important categories is ignored, such that the R2L (unauthorized access from the remote host) and the U2R (unauthorized local super-user privilege access) have a low detection rate, however, after the two types of behavior intrusion are successful, the server resources can be stolen or destroyed, It is very urgent to improve its detection performance. In this paper, the main causes of the two kinds of attack detection results of R2L and U2R are analyzed in this paper. The main cause of this paper is that the data distribution is not balanced, leading to the skew of the classification. It is an unbalanced classification problem (that is, the distribution of the training concentrated data is extremely unbalanced, the number of samples of one or some classes is far greater than or smaller than the other categories), and the other is that the two types of attacks are difficult to distinguish from the header, and the detailed content information of the data packet is required. Through the analysis and research of the common intrusion detection method, it is found that they all adopt the same method to detect various types, so it is difficult to achieve the ideal effect, and the cascade of multiple classifiers can effectively solve the problem of unbalanced data distribution in the intrusion detection. The intrusion detection is a typical non-equilibrium classification problem. In this paper, the non-equilibrium classification method such as re-sampling is deeply studied in this paper, and the method of NCL (neighborhood cleaning) filter is introduced to the problem of noise and boundary data in the process of re-sampling the intrusion detection data set by the SMOTE. An improved re-sampling method, SMOTE-NCL, is proposed to filter out the noise and boundary data. In this paper, the cascade classifier is used for intrusion detection due to the advantages of the cascade classifier method in solving the problem of unbalanced classification and the good effect in the intrusion detection. However, considering the influence of the feature dimension of the intrusion detection data set on the detection performance, this paper selects the feature subset for the cascaded classifier by introducing the improved optimized CGFR feature selection method. And then the CGFR and the SMOTE-NCL are applied to a cascade classifier, and on the basis of that, a cascade classifier intrusion detection model based on the re-sampling is proposed to solve the problem that the two types of attack detection effects of the R2L and U2R are not ideal in the prior intrusion detection method. according to the theoretical analysis experiment, the classification method in the cascade classifier selected by the invention is a decision tree algorithm (C4.5) and a Naive Bayes (NB) algorithm, and the first classifier of the model cascade is used for training a Do S (denial of service attack), Probe (port scan) and Normal (normal data), the second classifier is used to train three types of Normal, R2L and U2R; in the course of detection, the test set first enters the first classifier to be classified by the classifier as normal data into the second classifier, and finally can complete Do S, Probe, The classification of the Normal, R2L and U2R categories. In this paper, the classification results of the feature subsets selected by the feature selection method and the CGFR method on the cascade classifier are compared, and the results of the classification using the cascade classifier on the data set with different sampling rates of the original data set and the SMOTE and the SMOTE-NCL re-sampling are compared. Finally, the results of classification by using the SVM, KNN, NB, C4.5 and the cascade classifier method on the data set of the SMOTE-NCL re-sampling are compared, and the AUC values of the cascade classifier intrusion detection model based on CGFR and SMOTE-NCL are higher than that of other cases for both U2R and R2L attacks. However, the detection result of the R2L is still not ideal because the R2L class attack is difficult to distinguish by the packet header feature, and the detailed content characteristics of the data packet are required to determine that the large number of sample header features are not identical to Normal, so the detection effect is not ideal. To further solve this problem, the author considers that part of the feature should be extracted from the contents of the data packet when the data is extracted, and the training set and test set can be dynamically generated, which is also the work of the next step.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.08

【参考文献】