基于机器学习的异常流量检测系统的设计与实现

发布时间：2018-04-09 13:28

本文选题：流量分析　切入点：异常检测　出处：《北京邮电大学》2017年硕士论文

【摘要】：现如今随着互联网技术的不断发展,人们的生活和工作越来越依赖于各种互联网应用。但由于安全意识的缺乏和攻击技术不断向复杂化、多样化发展,许多网络应用都遭受着各种各样的网络攻击和安全威胁,暴露出很多的网络安全漏洞。异常流量检测作为攻击防御的第一步为攻击的拦截提供了有效的保障,因此,准确地检测出异常流量是保障网络应用可用性和安全性的必需。本文通过研究现有的异常流量检测技术,把先进的机器学习方法引入到异常检测系统中,提出并设计一个基于机器学习的异常流量检测的模型。该模型主要包括四个部分:1)从数据挖掘角度统计分析异常流量的特点并形成恶意关键字库与多维特征库;2)对多维特征库进行有效性测试与集合优化;3)选择机器学习算法对训练集进行学习与验证,对分类结果进行性能评估;4)在系统的实际应用中将其部署于Hadoop与Spark云平台,通过并行化的检测提高异常流量检测的效率。在分析异常流量特点的研究中,结合了基于特征规则和基于统计分析的方法,把异常流量检测看作一个模式识别问题,分解出异常流量的共性以及与正常流量之间的差异性,将其归纳学习为特征字段,供机器学习算法进行验证和评估。在特征优化的研究中,本文提出了基于Sigmoid的特征选择算法,基于信息增益的特征排序算法以及基于时间反馈的特征优化算法三个特征提取算法。通过过滤,排序,性能优化三个步骤挖掘出多维特征集合中最优的特征子集。在机器学习算法的选择上,本文比较并评估了决策树,随机森林和GBDT三种优秀的分类算法,并将并行化考虑其中,最终实验证明了 GBDT算法在准确率和召回率上的优势。最后,本文考虑到系统实际应用所面临的大数据环境,设计并实现了一套基于分布式的检测系统,利用Hadoop和Spark分布式平台与云存储的数据处理优势,将数据预处理,特征解析以及机器学习过程实现了完全的并行化,大大提高了系统的检测效率。
[Abstract]:Nowadays, with the continuous development of Internet technology, people's life and work are more and more dependent on various Internet applications.However, due to the lack of security awareness and the continuous development of attack technology, many network applications suffer from various network attacks and security threats, exposing a lot of network security vulnerabilities.As the first step of attack defense, anomaly traffic detection provides an effective guarantee for the interception of attacks. Therefore, it is necessary to accurately detect abnormal traffic to ensure the usability and security of network applications.This paper introduces the advanced machine learning method into the anomaly detection system by studying the existing abnormal traffic detection technology, and proposes and designs a model of abnormal traffic detection based on machine learning.The model mainly includes four parts: 1) from the angle of data mining, the characteristics of abnormal traffic are statistically analyzed and the malicious keyword library and multidimensional signature library are formed. (2) the validity test and set optimization of multidimensional signature library are carried out.The learning and verification of the training set is based on the learning algorithm.Performance evaluation of the classification results is carried out. In the practical application of the system, it is deployed on the cloud platform of Hadoop and Spark to improve the efficiency of anomaly traffic detection by parallel detection.In the research of analyzing the characteristics of abnormal traffic, combining the method based on feature rule and statistical analysis, the detection of abnormal traffic is regarded as a pattern recognition problem, which decomposes the commonness of abnormal traffic and the difference between abnormal flow and normal traffic.Its inductive learning is used as feature field for machine learning algorithm to verify and evaluate.In the research of feature optimization, this paper proposes three feature extraction algorithms: feature selection algorithm based on Sigmoid, feature sorting algorithm based on information gain and feature optimization algorithm based on time feedback.Through filtering, sorting and performance optimization, the optimal feature subset of multidimensional feature set is mined.In the selection of machine learning algorithm, this paper compares and evaluates three excellent classification algorithms: decision tree, random forest and GBDT, and considers the parallelism among them. Finally, the experiment proves the superiority of GBDT algorithm in accuracy and recall.Finally, considering the big data environment that the system is facing in practical application, this paper designs and implements a set of distributed detection system, which makes use of the advantages of Hadoop and Spark distributed platform and cloud storage to preprocess the data.The process of feature resolution and machine learning achieves complete parallelization, which greatly improves the detection efficiency of the system.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181;TP393.06

【参考文献】