基于改进的N-gram恶意PDF文档静态检测技术研究
发布时间:2018-04-28 13:42
本文选题:PDF文档 + Java ; 参考:《东华理工大学》2017年硕士论文
【摘要】:随着信息技术的发展和办公自动化的普及,PDF文档逐渐成为人们工作和学习上必不可少的首选应用文本软件。尽管PDF文档带来诸多便利,使用过程中渐渐出现很多安全问题。攻击者利用PDF文档格式漏洞嵌入恶意JavaScript代码进行攻击,获取特定目标的私密信息,给特定目标造成无法估计的损失。因此检测和防范嵌入恶意JavaScript代码的PDF文档逐渐成为信息安全领域国内外研究学者研究的重要目标。本文对PDF文档进行分析,主要介绍PDF文档的物理结构与逻辑结构、PDF文档的攻击技术及恶意PDF文档的传播途径。深入分析现有基于N-gram的恶意PDF文档静态检测模型,存在两点不足:第一,忽略了PDF文档中隐藏信息对提取的JavaScript代码完整程度的影响以及对提取出来的JavaScript代码预处理不足;第二,N-gram特征提取方法只能提取到固定长度的N-gram特征,导致有效特征被分隔开。论文针对上述问题提出了一种改进的N-gram恶意PDF文档静态检测模型,设计一个PDF文档预处理流程,包括解密处理、解码处理、JavaScript定位与提取和JavaScript去混淆处理,确保提取的JavaScript代码完整及有效;在现有N-gram特征提取方法基础上进行改进,确保提取到更有效的N-gram特征向量。为了验证改进的N-gram特征提取方法的有效性,使用改进前后的N-gram特征提取方法进行特征提取,将提取到的特征向量作为数据输入部分,使用多种检测算法进行训练与测试得到检测结果,同时将检测算法结合Boosting算法进行训练与测试得到检测结果。通过检测结果,验证了本文提出的改进的N-gram特征提取方法对恶意PDF文档检测有效,并且比对改进前的N-gram特征提取方法,取得更优的检测效果,同时结合Boosting算法可以提升检测模型的检测性能,与DPScan模型、PJScan模型相比较检测性能更好。
[Abstract]:With the development of information technology and the popularization of office automation, PDF documents are becoming the first choice software for people to work and study. Although PDF documents bring a lot of convenience, there are many security problems in the process of use. Attackers exploit the PDF document format vulnerability to embed malicious JavaScript code to attack, obtain private information of specific targets, and cause incalculable losses to specific targets. Therefore, detecting and guarding against PDF documents embedded in malicious JavaScript code has gradually become an important research goal in the field of information security. This paper analyzes the PDF documents, mainly introduces the physical and logical structure of PDF documents and the attack technology of PDF documents and the propagation of malicious PDF documents. There are two shortcomings in the existing static detection model of malicious PDF document based on N-gram. Firstly, the influence of hidden information in PDF document on the integrity of extracted JavaScript code and the insufficient preprocessing of extracted JavaScript code are ignored. The second N-gram feature extraction method can only extract N-gram features of fixed length, which leads to the separation of effective features. In this paper, an improved static detection model of N-gram malicious PDF document is proposed, and a preprocessing process of PDF document is designed, which includes decryption, decoding, location and extraction of N-gram, and JavaScript obfuscation. To ensure the integrity and efficiency of the extracted JavaScript code and to improve the existing N-gram feature extraction methods to ensure the extraction of a more effective N-gram feature vector. In order to verify the effectiveness of the improved N-gram feature extraction method, the improved N-gram feature extraction method is used for feature extraction, and the extracted feature vector is used as the data input part. The detection results are obtained by training and testing with a variety of detection algorithms, and the detection results are obtained by combining the detection algorithm with the Boosting algorithm. The detection results show that the improved N-gram feature extraction method proposed in this paper is effective in detecting malicious PDF documents, and it has better detection effect than the N-gram feature extraction method before the improvement. At the same time, combined with Boosting algorithm, the detection performance of the detection model can be improved, and compared with the DPScan model, the detection performance is better.
【学位授予单位】:东华理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP309
【参考文献】
相关期刊论文 前4条
1 郝晨曦;方勇;;基于频谱分析的PDF文件恶意代码检测方法[J];信息安全研究;2016年02期
2 陈亮;陈性元;孙奕;杜学绘;;基于结构路径的恶意PDF文档检测[J];计算机科学;2015年02期
3 李卫东;宋威;李欣;杨炳儒;;一种多标准决策树剪枝方法及其在入侵检测中的应用[J];北京科技大学学报;2007年04期
4 栾丽华,吉根林;决策树分类技术研究[J];计算机工程;2004年09期
相关硕士学位论文 前4条
1 孙本阳;PDF文档的安全性检测技术研究[D];上海交通大学;2015年
2 杨书金;基于SVM模型的恶意网页及PDF文档检测技术研究[D];江西理工大学;2014年
3 丁晓煌;恶意PDF文档的静态检测技术研究[D];西安电子科技大学;2014年
4 武雪峰;恶意PDF文档的分析[D];山东大学;2012年
,本文编号:1815508
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1815508.html