基于改进K最近邻模型的反馈学习垃圾邮件过滤系统的设计与实现

发布时间：2018-11-13 07:04

【摘要】： 电子邮件技术已经成为一种快捷、经济的现代通信手段,几乎每个网络用户都有自己的邮件信箱。然而电子邮件也日益成为商业广告、病毒、木马等内容的重要载体,垃圾邮件的泛滥成灾给人们的正常生活带来了极大的危害和不便,同时极坏地影响了网络安全,占用了宝贵的带宽资源,占用了邮件服务器大量的存储空间。尽管目前已经存在许多的垃圾邮件过滤方法,但是垃圾邮件不降反升的局面表明,已有的垃圾邮件过滤方法并未取得理想的过滤效果。所以,研究新型高效的邮件过滤系统仍具有特别重要的现实意义。在垃圾邮件过滤研究领域已有的算法中,都是基于规则或基于内容的,其中基于规则的过滤算法需要用户长期定制和维护规则,其实质还是生硬的二值判断,局限在二维空间内进行处理,缺少可信度;基于内容的过滤算法大多数是基于向量空间模型的算法,其中广泛使用的是朴素贝叶斯算法和K最近邻(KNN)算法。虽然朴素贝叶斯邮件过滤器计算简便,但召回率和正确率都难以进一步提高。由于KNN算法计算复杂度太高而不适用于大规模场合和实时性要求高的场合。为此,提出邮件的合法属性和非法属性的概率,提出新的分类算法——基于邮件合法属性和非法属性的分类算法SEAFS算法。SEAFS垃圾邮件过滤算法结合KNN模型和朴素贝叶斯模型的优点,克服了KNN模型和朴素贝叶斯模型的缺点,将普通垃圾邮件过滤方法的线性过滤转化为非线性过滤,既提高了过滤准确度,又达到了令人满意的过滤效率,适用于大规模场合和实时性要求高的场合,有利于大规模邮件内容进行实时在线的垃圾邮件过滤。电子邮件的内容是随时间而变化的,用户的个性化需求也在不断改变,所以在对垃圾邮件的研究中加入了反馈学习过程,以捕捉这些变化,解决这一问题。本文设计并实现了一个实用、高效的垃圾邮件过滤系统,进行了大量实验,获得了良好的过滤效果,实验论证了SEAFS算法在垃圾邮件过滤中的可行性和有效性。
[Abstract]:E-mail technology has become a fast, economical modern means of communication, almost every network user has their own mail box. However, email is increasingly becoming an important carrier of commercial advertisements, viruses, Trojans, etc. The spamming of spam has brought great harm and inconvenience to people's normal lives, and at the same time, it has affected the network security very badly. It takes up valuable bandwidth resources and takes up a lot of storage space of mail server. Although there are many spam filtering methods at present, the situation that spam is not decreasing but rising shows that the existing spam filtering methods have not achieved the ideal filtering effect. Therefore, the study of new and efficient mail filtering system is still of great practical significance. Among the existing algorithms in the field of spam filtering, they are based on rules or content, in which rule-based filtering algorithms require users to customize and maintain rules for a long time. Limited in the two-dimensional space processing, the lack of credibility; Most content-based filtering algorithms are based on vector space model, among which naive Bayes algorithm and K-nearest neighbor (KNN) algorithm are widely used. Although naive Bayesian email filter is easy to calculate, it is difficult to improve recall rate and correct rate. Due to the high computational complexity of the KNN algorithm, it is not suitable for large-scale and real-time applications. For this reason, the probability of the legal and illegal attributes of mail is proposed, In this paper, a new classification algorithm, SEAFS algorithm based on legal and illegal attributes of mail, is proposed. The SEAFS spam filtering algorithm combines the advantages of KNN model and naive Bayesian model, and overcomes the shortcomings of KNN model and naive Bayesian model. The linear filtering of ordinary spam filtering method is transformed into nonlinear filtering, which not only improves the filtering accuracy, but also achieves satisfactory filtering efficiency. It is suitable for large-scale and real-time situations. Large-scale email content for real-time online spam filtering. The content of email changes with time, and the individual needs of users change constantly. So feedback learning process is added to the research of spam to capture these changes and solve this problem. In this paper, a practical and efficient spam filtering system is designed and implemented. A large number of experiments are carried out, and good filtering results are obtained. The experiment proves the feasibility and effectiveness of SEAFS algorithm in spam filtering.
【学位授予单位】：东北师范大学
【学位级别】：硕士
【学位授予年份】：2010
【分类号】：TP393.098

【参考文献】