面向网络入侵检测的串匹配算法优化

发布时间：2018-04-24 09:27

本文选题：网络安全 + 入侵检测系统　；参考：《哈尔滨工业大学》2014年博士论文

【摘要】：互联网已经成为人们日常生活中不可或缺的一个部分。伴随而来的是日趋猖獗的网络犯罪对网络节点上主机的信息安全的威胁。出于不同的目的，网络使用者可以利用一定的技术或者社会手段，非法进入个人主机、盗取个人账户密码甚至是对网站进行协作攻击。建立一个坚固的预防、阻断网络犯罪的安全体系至关重要。入侵检测系统就是网络安全体系中不可或缺的重要组组成部分，其主要处理对象就是来自网络的数据包。入侵检测系统会对网络数据包和已知的数据库进行比对，根据已掌握的数据将数据包中的信息归类为入侵行为或者安全行为。针对入侵检测系统监测到的入侵行为，还可以将这些入侵告警进行分析进一步挖掘告警之间的逻辑联系，揭示隐含的攻击过程用于定位攻击源或者为后续检测提供知识储备。为了确保能有效地提取网络数据包中的信息需要高效的串匹配算法提供技术支持。然而随着网络技术的发展，各种新的攻击类型也层出不穷，这也使得串匹配自动机需要维护的特征集的规模越来越大，直接降低了入侵检测系统的性能。基于这样的背景，本文的研究目标是构造支持大规模特征集合的串匹配算法，并且能够找到对现有串匹配算法进行速度提升的方法。具体地，本文着重从以下几个方面、不同层次对串匹配的优化方法进行了深入的研究： (1)特征集规模过大导致匹配自动机无法正常构建限制了入侵检测系统能够识别的有害信息的数量。本文特别针对大特征集文本匹配问题，提出一种混合匹配自动机模型。该模型旨在通过长度对特征进行分类，构造更加紧凑的匹配自动机。通过将具有不同长度范围的特征分布到不同的自动机中，可以采用不同的方式对文本进行匹配。本文同时依据混合匹配自动机的结构特点设计了两种独立的文本过滤策略，即逆向状态映射状态转移方法（SLSPM算法）和二级AVL过滤策略（BCHDFA算法）。两种技术分别利用“将‘自动机状态-读入文本信息’的映射关系进行翻转”和“借助AVL树对特征段中特定位置上的双字符文本进行两次过滤”达到减少将文本块与特征块进行验证的次数的目的，从而抑制混合自动机匹配速度降低的趋势。 (2)针对URL等长特征，本文结合SSE指令提出了两种专门面向16字节数据的匹配算法SSEHash和RTRIE，这两种算法旨在提高长特征的匹配速度。SSEHash算法中，通过2个指令周期共16个加法运算和一个24比特模运算能够将16字节文本比较均匀地分布到24比特的数据中。由于采用SSE指令进行计算，使得SSEHash算法和常规意义的散列算法相比需要更短的处理时间，能够更快地过滤文本信息。为了解决大规模特征集环境下Wu-Manber类算法的移动距离较短的问题，本文结合SSE指令提出了一种反转自动机RTRIE。该算法提取每条长特征（长度至少16）的16字节前缀中的所有后缀子串的逆向串的指纹，并为每个指纹计算其安全移动距离。该算法和通常的Wu-Manber类算法相比在计算文本指针移动距离时考虑的文本字符数目更多，这样可以尽量避免由于特征规模增加而导致的指针移动距离的降低。 (3)为了提高表达式匹配性能消除无用表达式对内存的占用，本文分析了表达式的各种包含关系并提出了BitCount算法的优化算法MaskVeri以及表达式冗余消除算法KPGEM。为了不遗漏任何可能的表达式的包含关系，本文借助表达式中出现的各个短关键字的位置以及关键字的长度等信息限定了各关键字的前驱后继关系。借助这种关系可以定义针对每条表达式的关键字路径图。通过对关键字路径图的遍历KPGEM算法可以分析出当前表达式中隐含的其它表达式，并正确地删除表达式中冗余的关键字或者包含其它表达式的表达式。 (4)基于串匹配算法的数据包检测部件是入侵检测系统中的重要功能部件。在此基础上才能对通过串匹配模块获得的网络告警日志进行告警关联分析。本文在最后一部分实现了一个入侵告警事件关联系统。该系统通过对告警中的重复结构进行保持语义的序列最小化处理，使得串匹配算法捕获的告警得到大幅精简。系统中采用D-S证据理论对宏观分析方法（序列挖掘）和微观分析方法（权能提升推理）二者处理的结果进行融合，，并通过信任能量计算公式对经过D-S融合后的告警序列进行再处理，从获得的频繁序列中得到更能体现前后关系的攻击告警序列。该系统能够获得有效的攻击场景，并达到自动提取关键攻击序列的目的。
[Abstract]:The Internet has become an indispensable part of people's daily life. It is accompanied by the threat that the increasingly rampant network crime threatens the information security of the host on the network nodes. For different purposes, users can use certain technical or social means to enter the personal host by non law and steal personal account passwords. The intrusion detection system is an important component part of the network security system, and its main object is the data packet from the network. The intrusion detection system will carry on the network data packet and the known data. According to the data that has been mastered, the data is classified as intrusion or security behavior according to the data already mastered. In view of the intrusion behavior monitored by the intrusion detection system, these Intrusion Alerts can be analyzed to further excavate the logical connection between the alarm and reveal the hidden attack process used to locate the source of the attack or to be after the attack. Continuous detection provides knowledge reserves. In order to ensure effective extraction of information in network packets, efficient string matching algorithms are required. However, with the development of network technology, various new types of attack are emerging, which makes the scale of the feature set that the string matching automata need to maintain is becoming more and more large and is directly reduced. The performance of the intrusion detection system.
Based on this background, the aim of this paper is to construct a series matching algorithm that supports a large set of features, and to find a way to speed up the current string matching algorithm.
(1) the size of the feature set is too large to limit the number of harmful information that the intrusion detection system can recognize. In this paper, a hybrid matching automaton model is proposed for the text matching problem of the great special collection. This model is designed to classify the characteristics by the length and construct a more compact matching automatically. By distributing features of different length range to different automata, the text can be matched in different ways. In this paper, two independent text filtering strategies are designed based on the structure characteristics of mixed matching automata, that is, reverse state mapping state transfer (SLSPM algorithm) and two level AVL filtering strategy. (BCHDFA algorithm). The two techniques use "the mapping relationship of the automata state reading text information" and "the two filtering with the help of the double character text on the specific location in the feature segment" to reduce the number of verifying the text block and the feature block by the AVL tree, thus restraining the mixed automata. The trend of decreasing speed.
(2) in view of the URL isometric feature, this paper presents two matching algorithms, SSEHash and RTRIE, which are specialized for 16 byte data in combination with the SSE instruction. The two algorithms are designed to improve the long feature matching speed.SSEHash algorithm with a total of 16 addition operations and a 24 specific mode operation in the 2 instruction cycles to distribute the 16 byte text more evenly. To 24 bits of data, because of the use of SSE instruction, the SSEHash algorithm needs shorter processing time compared with the conventional hash algorithm and can filter text information faster. In order to solve the problem of the shorter moving distance of the Wu-Manber class algorithm in the large model collection environment, this paper presents a kind of SSE instruction. This algorithm extracts the fingerprint of the reverse string of all the suffixes in the 16 byte prefix of each long feature (at least 16) and calculates its safe moving distance for each fingerprint. This algorithm and the usual Wu-Manber class algorithm consider the number of text characters in the calculation of the moving distance of the text pointer, so that the number of text characters is more. It is possible to avoid the reduction of pointer movement distance due to the increase of feature size.
(3) in order to improve the performance of expression matching to eliminate the use of useless expressions for memory, this paper analyzes the various inclusion relations of the expression and proposes an optimization algorithm MaskVeri of BitCount algorithm and the expression redundancy elimination algorithm KPGEM. in order to avoid missing any possible expression. The position of each short keyword and the length of the keyword limit the forward and subsequent relations of each keyword. By this relationship, the keyword path graph for each expression can be defined. By traversing the keyword path graph, the KPGEM algorithm can be used to analyze the other expressions implying in the current expression and delete it correctly. A redundant keyword in an expression or an expression containing other expressions.
(4) the packet detection unit based on string matching algorithm is an important functional component in the intrusion detection system. On this basis, the network alarm log obtained through the string matching module can be analyzed. In the last part, an intrusion alarm event closing system is implemented. The system is repeated in the alarm. The structure carries out the sequential minimization of the semantics, making the alarm captured by the string matching algorithm greatly streamlined. In the system, the D-S evidence theory is used to fuse the results of the macro analysis method (sequence mining) and the microanalysis method (the weight lifting reasoning), and the D-S fusion formula is used to fuse the D-S through the trust energy calculation formula. The alarm sequence is reprocessed to get the attack alarm sequence which is more able to reflect the front and back relations from the frequent sequences obtained. The system can obtain the effective attack scene and automatically extract the key attack sequence.

【学位授予单位】：哈尔滨工业大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP393.08

【参考文献】