网络安全审计中基于Hadoop的敏感词检测技术研究

发布时间：2018-05-07 05:42

本文选题：内容审计 + XML　；参考：《东华大学》2015年硕士论文

【摘要】：随着互联网的普及，网络中的信息资源越发丰富。与此同时，越来越多的非法信息、不良信息、敏感信息也充斥网络，网络成为封建迷信、色情暴力、反动言论、谣言讹传等信息传播的主要媒介。面对这些威胁网络安全的因素，安全审计因其实时性、动态性和主动防御的特点，为网络提供了很好的安全保障。论文结合某公司一个实际的网络安全审计系统项目，重点研究了内容审计中的敏感词检测技术。首先介绍了敏感词检测与网络安全审计的概念、研究现状，以及与课题相关的技术。在分析系统功能需求的基础上，给出了系统的总体实现模型。实际项目的日志数据，以XML格式存储，具有语义和结构双重信息。论文结合双数组Trie树和Dewey编码，重点研究了XML文档中的敏感词检测技术，提出了敏感度的概念，并给出其计算方法。结合研究结果，论文最后设计并实现了一个敏感词检测系统原型，验证了课题所研究的方法和技术的有效性。论文的主要工作有以下几个方面。分析了网络信息安全审计系统的功能需求，设计了系统的总体实现模型。结合内容审计，分析了其中基于日志审计的流程，给出了日志数据的格式，明确了敏感词检测技术研究的对象。敏感词检测的数据对象是XML格式的日志数据。为了获取其结构信息，实现复杂结构的敏感词检测，论文研究了基于Dewey编码的XML文档编码方式，将XML文档树中父节点的编码直接作为其孩子节点编码的前缀，从而可以方便的获取节点所在的层和节点间的结构关系，，有利于简便地计算出日志的结构敏感度。为了提高敏感词检测的效率，需要为敏感词库建立索引。论文采用双数组Trie树，为敏感词库构建索引，研究了基于语义和结合结构信息的敏感词检测算法。一方面，根据节点的权值和敏感词出现的频率，来计算语义敏感度，给出了敏感度的计算公式。另一方面，在敏感词具有结构信息时，需要结合语义和结构信息进行敏感词检测。通过敏感词间距离的计算，先进行语义上的匹配，然后再进行结构相似性的匹配，实现了包含结构信息的敏感词检测。结合所研究的敏感词检测技术，论文设计并实现了一个网络安全审计中敏感词检测系统的原型。将系统分为用户接口、信息准备、检测引擎和审计策略四个子系统。设计了系统的总体架构，分析了用户与系统的交互过程。在此基础上，详细介绍了各个子系统的设计与实现。将Dewey编码生成算法、基于双数组Trie树索引结构的检测算法进行合理地分解，应用在实验搭建的Hadoop集群环境中，在一定程度上提高了系统的可扩展性。
[Abstract]:With the popularity of the Internet, the information resources in the network are more and more abundant. At the same time, more and more illegal information, bad information, sensitive information is also flooded with the Internet, the network has become the feudal superstition, pornographic violence, reactionary remarks, rumors and other information dissemination of the main media. In the face of these factors which threaten the network security, the security audit provides a good security for the network because of its real-time, dynamic and active defense characteristics. Based on a project of a company's network security audit system, this paper focuses on the detection technology of sensitive words in content audit. Firstly, the concepts of sensitive word detection and network security audit are introduced. Based on the analysis of the functional requirements of the system, the overall implementation model of the system is given. The log data of the actual project is stored in XML format with both semantic and structural information. Combined with double array Trie tree and Dewey coding, this paper focuses on the detection technology of sensitive words in XML documents, puts forward the concept of sensitivity and gives its calculation method. Finally, a prototype of sensitive word detection system is designed and implemented, which verifies the effectiveness of the methods and techniques studied in this paper. The main work of this paper is as follows. The functional requirements of network information security audit system are analyzed, and the overall implementation model of the system is designed. Combined with content audit, the flow of log audit is analyzed, the format of log data is given, and the research object of sensitive word detection technology is defined. The data object detected by sensitive words is log data in XML format. In order to obtain the structure information and detect the sensitive words of complex structure, the XML document coding method based on Dewey coding is studied in this paper. The encoding of the parent node in the XML document tree is directly used as the prefix of the child node coding. Therefore, the structure relationship between the layers and nodes can be easily obtained, and the structural sensitivity of the log can be calculated easily. In order to improve the efficiency of sensitive word detection, it is necessary to index sensitive lexicon. In this paper, we use double array Trie tree to build index for sensitive lexicon, and study the detection algorithm of sensitive words based on semantic and structural information. On the one hand, the semantic sensitivity is calculated according to the weights of nodes and the frequency of the occurrence of sensitive words, and the formula of sensitivity is given. On the other hand, when sensitive words have structural information, it is necessary to combine semantic and structural information to detect sensitive words. By calculating the distance between the sensitive words, the semantic matching is carried out, and then the structural similarity matching is carried out, which realizes the detection of the sensitive words containing structural information. This paper designs and implements a prototype of sensitive word detection system in network security audit. The system is divided into four subsystems: user interface, information preparation, detection engine and audit strategy. The overall architecture of the system is designed, and the interaction process between the user and the system is analyzed. On this basis, the design and implementation of each subsystem are introduced in detail. The Dewey coding generation algorithm and the detection algorithm based on double array Trie tree index structure are decomposed reasonably and applied to the experimental Hadoop cluster environment, which improves the extensibility of the system to a certain extent.
【学位授予单位】：东华大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP393.08

【参考文献】