基于Active SVM算法的恶意网页检测技术研究

发布时间：2018-11-11 18:03

【摘要】：网络时代,以脚本语言和浏览器插件技术为基础的新兴应用层见叠出,但是伴随着享受这些应用带来的方便和快捷的同时,我们也发现,信息泄露、信息窃取、数据篡改、数据删添、计算机病毒等等各种人为攻击也越来越肆虐。针对Web威胁的网络攻击是网民受到的最主要的攻击。攻击者通过精心构造攻击代码,利用浏览器或者第三方插件的漏洞,达到攻击目的。恶意代码编写者开发出大量恶意代码,并通过多种混淆手段对恶意脚本进行混淆和变形,逃避以特征码检测技术为主代表的恶意代码检测,其中尤其以JavaScript混淆代码为巨。各种混淆方式的应用产生了大量恶意代码的变种,借由因特网的时效性,迅捷性以广泛撒网式的传播方式威胁网民信息安全。这大大干扰了恶意代码的检测,成为整个web恶意代码中最为艰难的防御点。如何将此类攻击阻挡于我们计算机之外,保护网民的各类信息不受威胁,是当今社会亟待解决的问题,也是网络安全专家们前仆后继想要有所突破的问题。论文主要研究了JavaScript混淆技术,提出了基于TF-IDF算法的特征提取,加入文本分类中的权重分析,使得对JavaScript脚本的特征抽取更科学,并且实验表明,基于TF-IDF的特征提取比传统的特征提取方法性能有很大提升。本文还将监督学习传统SVM的不足进行改进,提出了机器学习中主动学习策略,来简化人工操作,提高效率,实现系统的高度智能化,实验证明,基于Active SVM的恶意网页检测系统能在更少的样本标注,更少的人力投入情况下达到更好的性能。
[Abstract]:In the era of network, new applications based on scripting language and browser plug-in technology have emerged, but along with the convenience and speed brought by these applications, we also find that information disclosure, information theft, data tampering, Data deletion, computer viruses and other human attacks are more and more rampant. The network attack against Web threat is the most important attack to netizens. Attackers exploit vulnerabilities in browsers or third-party plug-ins by crafting attack code. Malicious code writers develop a large number of malicious code, and through a variety of obfuscation means to obfuscation and deformation of malicious scripts, to escape the signature detection technology represented by malicious code detection, especially the JavaScript obfuscation code as a giant. The application of various confusion methods has produced a large number of malicious code variants, by the timeliness of the Internet, the rapid spread of a wide spread of Internet users to threaten the security of information. This greatly interferes with the detection of malicious code and becomes the most difficult defense point in the whole web malicious code. How to block such attacks outside our computer and protect all kinds of information of Internet users from threats is a problem to be solved urgently in today's society, and it is also a problem that network security experts want to break through one after another. This paper mainly studies the JavaScript obfuscation technology, proposes the feature extraction based on the TF-IDF algorithm, adds the weight analysis in the text classification, makes the feature extraction of the JavaScript script more scientific, and the experiment shows that, The performance of feature extraction based on TF-IDF is much better than that of traditional feature extraction methods. In this paper, the shortcomings of traditional SVM are improved, and the active learning strategy in machine learning is put forward to simplify manual operation, improve efficiency and realize high intelligence of the system. The malicious web page detection system based on Active SVM can achieve better performance with less sample tagging and less manpower input.
【学位授予单位】：南京理工大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.08

【参考文献】