基于网络爬虫的虚假网页主动智能检测

发布时间：2018-02-10 12:50

本文关键词： 虚假网页检测主动检测网页特征提取深度学习算法机器学习算法　出处：《华北电力大学》2015年硕士论文　论文类型：学位论文

【摘要】：网络钓鱼是通过给用户投递来自企业组织或者金融机构的欺骗性垃圾邮件,引诱用户泄露个人私密隐私信息的一种攻击方式。最常见的方式是将用户引诱到与目标正常网页十分类似的虚假网页上,并窃取受害者在其网页上保存的个人私密信息。近几年来随着虚假网页的危害越来越严重,虚假网页检测作为一种反钓鱼技术与措施被人们广泛关注。本文提出一种基于网络爬虫的虚假网页主动智能检测系统,在得到与目标网站相似网页的基础上,通过提取相似网页的特征并对特征向量利用Autoencoder进行降维预处理,最后再利用BVM分类器检测辨别虚假网页。首先,由于被动检测的滞后性,论文采用主动检测模式,即使用编辑距离计算出种子站点与目标站点URL地址相似的网页。其次,在得到相似网页的基础上,对这些网页分别进行特征提取,虚假网页的检测结果很大程度上取决于网站特征的提取,本文较全面的提取了网页的文档特征和拓扑特征,并且充实了特征元素的种类,在对网页的文本特征和源码分析的基础上,提出了更加准确全面的虚假网页特征向量,然后利用Autoencoder对其特征向量进行降维预处理,使处理后的特征向量更加符合分类器的要求,并且提高了虚假网页检测的精度。再次,论文利用机器学习算法BVM构建了虚假网页主动智能检测分类器,给出了基于BVM的虚假网页的智能检测的步骤和实验结果,并分析了算法的优缺点。通过大量的实验,得出本文提出的基于BVM虚假网页主动智能检测方法具有较高的精确度并且有较短的消耗时间。最后,本文用Java Web技术实现了一个基于网络爬虫的虚假网页主动智能检测系统,本系统采用B/S结构设计,展示了系统的架构设计和系统的各功能界面。
[Abstract]:Phishing is by sending users deceptive spam from business organizations or financial institutions. An attack that induces users to reveal private and private information. The most common way is to lure users to fake pages that are very similar to the normal pages of the target. And steal the personal and private information that victims keep on their web pages. In recent years, with the damage of fake web pages becoming more and more serious, As an anti-phishing technique and measure, false web page detection has been paid more and more attention. In this paper, an active intelligent detection system based on web crawler is proposed. By extracting the features of similar web pages and using Autoencoder to reduce the dimension of feature vectors, finally using BVM classifier to detect and identify false pages. Firstly, due to the lag of passive detection, the active detection mode is adopted in this paper. Even if the web pages with similar URL addresses of the seed site and the target site are calculated by using the edit distance. Secondly, on the basis of obtaining the similar pages, the features of these pages are extracted respectively. The detection results of false web pages largely depend on the extraction of website features. In this paper, the document features and topological features of web pages are extracted comprehensively, and the types of feature elements are enriched. Based on the analysis of the text features and the source code of the web pages, a more accurate and comprehensive feature vector of the false web page is proposed, and then the dimension reduction of the feature vector is preprocessed by using Autoencoder to make the processed feature vector more in line with the requirements of the classifier. And improve the accuracy of false web page detection. Thirdly, this paper uses machine learning algorithm BVM to construct a false web page active intelligent detection classifier, and gives the steps and experimental results of the false web page intelligent detection based on BVM. The advantages and disadvantages of the algorithm are analyzed. Through a large number of experiments, it is concluded that the active intelligent detection method proposed in this paper based on BVM false web pages has high accuracy and short time consumption. Finally, In this paper, an active intelligent detection system of false web pages based on web crawler is implemented by using Java Web technology. The system is designed with the structure of B / S, which shows the architecture design of the system and the functional interface of the system.
【学位授予单位】：华北电力大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP393.092

【参考文献】