基于启发式的钓鱼网站检测技术的研究与实现

发布时间：2019-01-13 20:22

【摘要】：钓鱼网站是在网页中包含恶意欺骗信息,引诱互联网用户提交个人信息从而窃取其隐私信息乃至个人财产的一种网络攻击方式。为了提高钓鱼网站检测的准确性,减少对第三方工具及资源的依赖性,本文对钓鱼网站启发式检测技术以及钓鱼页面主题识别技术展开了研究。首先,本文对网页内容预处理关键技术展开研究,在网页数据采集和存储方面,本文提出了一种更新式存储策略,定期对第三方平台公布的钓鱼网站进行信息资源采集。在网页文本特征获取方面,则利用针对网页文本的m-TextRank文本关键词抽取算法对网页文本信息特征进行抽取及储存。其次,为提高钓鱼检测的精确度和稳定性,本文通过及时识别新特征和精确选择最佳特征子集的方式来优化检测方案,并提出了一种多层启发式钓鱼网站检测模型包括特征提取层、特征选择层以及启发式分类层。该模型利用五个特征选择算法来预处理特征集,并研究了三种基于决策树的分类算法的性能与效果。实验结果表明,使用信息增益算法进行特征选择并结合随机树分类算法的钓鱼网站检测方法能够在低时间开销下达到96%的准确率和95%的召回率。再次,为了研究网页主题和网页合法性的相关性以及钓鱼网站的主题分布情况,本文提出了基于LDA-SVM的钓鱼网页主题识别算法。该算法通过对网页文本内容进行预处理、Gibbs抽样、LDA建模、SVM分类、效果评估等步骤建立LDA-SVM主题分类模型从而实现对网页主题的识别。经实验验证,钓鱼网站的主题识别准确率可达93%。随后本文根据上述主题分类模型对经过启发式检测的网站进行主题鉴别,为启发式钓鱼网站的检测结果提供佐证。最后,在上述研究基础上,本文设计并实现了钓鱼网站启发式检测系统。该系统主要提供网页信息采集、合法性检测以及网页主题识别的功能。系统测试结果表明,系统能够满足对未知网站的合法性检测需求,整体满足预期目标。
[Abstract]:Phishing website is a kind of network attack way that contains malicious cheating information in the web page and induces Internet users to submit personal information to steal their privacy information and even personal property. In order to improve the accuracy of fishing site detection and reduce the dependence on third-party tools and resources, this paper studies the heuristic detection technology of fishing site and the technology of phishing page theme recognition. Firstly, this paper studies the key technologies of web content preprocessing. In the aspect of data acquisition and storage, this paper proposes a new storage strategy to collect information resources of phishing websites published by the third party platform periodically. In the aspect of web page text feature extraction, the m-TextRank text keyword extraction algorithm is used to extract and store the web page text information feature. Secondly, in order to improve the accuracy and stability of fishing detection, this paper optimizes the detection scheme by identifying new features in time and selecting the best feature subset accurately. A multi-layer heuristic phishing site detection model is proposed, which includes feature extraction layer, feature selection layer and heuristic classification layer. The model uses five feature selection algorithms to preprocess feature sets, and studies the performance and effect of three classification algorithms based on decision tree. The experimental results show that the fishing site detection method based on information gain algorithm and random tree classification algorithm can achieve 96% accuracy and 95% recall rate in low time cost. Thirdly, in order to study the correlation between the topic and the legitimacy of the web page and the distribution of the topic of the phishing website, this paper proposes a phishing page theme recognition algorithm based on LDA-SVM. The algorithm establishes the LDA-SVM topic classification model by preprocessing the web text content, Gibbs sampling, LDA modeling, SVM classification and effect evaluation. After experimental verification, fishing site theme recognition accuracy can be as high as 933. Then, according to the above topic classification model, the subject identification of heuristic websites is carried out to provide evidence for the detection results of heuristic phishing websites. Finally, on the basis of the above research, this paper designs and implements a heuristic detection system for fishing websites. The system mainly provides the functions of web page information collection, legitimacy detection and page theme recognition. The system test results show that the system can meet the legitimacy of the unknown website detection requirements, the overall satisfaction of the expected objectives.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.08

【参考文献】