改进的KNN算法在过滤垃圾邮件中的应用研究

发布时间：2018-02-09 05:53

本文关键词： 垃圾邮件 KNN算法偏依赖特性类中心向量　出处：《湖南大学》2010年硕士论文　论文类型：学位论文

【摘要】：随着互联网的广泛普及，电子邮件已经成为人们日常生活中最便捷、最经济的通信手段。但是电子邮件给用户带来便捷的同时，也带来了不可避免的副产品——垃圾邮件。由于实现比较简单以及受到利益的驱使，一些企业和个人采用了这种最经济的方式进行商业广告，一些黑客也采用发送电子邮件进行违法行为：盗窃用户的机密资料，攻击用户的电脑等。电子邮件用户几乎每天都要收到几十、几百封垃圾邮件，每天都需要花费一定的精力与时间来判断是否为垃圾邮件，并进行清除。垃圾邮件不仅影响到了电子邮件用户，对网络运营提供商和网络管理员也带来了无尽的烦恼。这些所谓的垃圾邮件会占用用户的带宽、时间和存储资源，如果泛滥严重甚至会对网络通信造成堵塞，使得正常邮件不能正常发送和接收，垃圾邮件严重阻碍了互联网的健康发展。因此对垃圾邮件过滤技术的研究具有很大的实用价值，也是亟待解决的问题。本文全面分析当前垃圾邮件的主要特点以及垃圾邮件过滤技术的发展状况；深入讨论了各种反垃圾邮件技术的相关理论和各自的优缺点。针对当前效果较好的KNN算法进行了深入的研究，针对KNN算法的不足：传统的KNN算法只考虑到相似度之和，或者简单的利用相似度个数的多少来进行判断；将KNN算法应用于垃圾邮件的过滤中，而没有考虑到垃圾邮件本身的偏依赖特性，即用户情愿多收到一封垃圾邮件，也不愿意让垃圾邮件过滤系统将自己的正常邮件误判为垃圾邮件过滤掉；传统的KNN分类算法进行分类时，每次都需要将待测样本和训练样本集中的每个样本进行比较，计算相似度，计算量十分的大，不适合实时性要求比较高的垃圾邮件过滤系统。本文针对上述KNN算法的不足之处进行改进，提出并设计一种考虑了偏依赖特性的基于平均相似度和相似度个数的KNN算法。该算法首先通过计算平均相似度而不是相似度之和来表示类权重值，同时考虑到相似样本的个数对分类性能的影响；其次引入了两个表示垃圾邮件本身的偏依赖特性的参数；最后，为了解决KNN算法的计算量大的缺点，本文利用类中心向量法的思想，通过将将原始样本转化为一个个小类，并计算每个小类的中心向量，以代表原始训练样本建立分类模型，这就相当于将大样本转化为小样本，减少了比较次数，，大大降低了KNN分类算法的计算量。实验表明，与传统的KNN算法进行对比，本文提出的APC-KNN算法应用于垃圾邮件的过滤，具有高正确率，低误报率等优点；并且能够更好的实现垃圾邮件的过滤，起到了保护电子邮件用户以及节省宽带等效果。
[Abstract]:With the popularity of the Internet, email has become the most convenient and economical means of communication in people's daily life. There is also the inevitable by-product of spam. Because of the simplicity of implementation and driven by profit, some companies and individuals have adopted this most economical way of advertising. Some hackers also break the law by sending emails: stealing confidential information from users, attacking their computers, etc. E-mail users receive dozens or hundreds of spam messages almost every day. It takes a certain amount of time and effort to determine whether or not it is spam every day and to clear it away. Spam affects not only e-mail users, but also email users. These so-called spam messages can take up users' bandwidth, time, and storage resources, and if flooding is serious, it can even jam network traffic. Normal mail can not be sent and received normally, spam seriously hinders the healthy development of the Internet. Therefore, the research of spam filtering technology has great practical value, and is also an urgent problem to be solved. This paper comprehensively analyzes the main characteristics of current spam and the development of spam filtering technology. In this paper, the relevant theories and advantages and disadvantages of various anti-spam technologies are discussed in depth. The KNN algorithm, which has better effect at present, is studied deeply, and the disadvantage of KNN algorithm is pointed out: the traditional KNN algorithm only considers the sum of similarity. Or simply using the number of similarity to determine; KNN algorithm is applied to spam filtering, not taking into account the spam itself partial dependence, that is, users prefer to receive one more spam, Also unwilling to let the spam filtering system misjudge their normal email as spam filtering; when the traditional KNN classification algorithm classifies, it needs to compare each sample in the training sample set with the test sample each time. The computation of similarity is very large, which is not suitable for spam filtering system with high real-time requirement. In this paper, the shortcomings of the above KNN algorithm are improved. This paper proposes and designs a KNN algorithm based on average similarity and number of similarity, which takes into account the property of partial dependence. Firstly, the average similarity is calculated instead of the sum of similarity to represent the class weight. At the same time, the effect of the number of similar samples on the classification performance is considered. Secondly, two parameters are introduced to express the partial dependence of spam itself. Finally, in order to solve the problem of large computational complexity of KNN algorithm, In this paper, the idea of class center vector method is used to transform the original sample into a small class, and calculate the center vector of each subclass to build a classification model on behalf of the original training sample, which is equivalent to transforming a large sample into a small sample. Compared with the traditional KNN algorithm, the proposed APC-KNN algorithm is applied to spam filtering, which has the advantages of high accuracy and low false alarm rate. And can better achieve spam filtering, played a role in protecting email users and saving broadband effect.
【学位授予单位】：湖南大学
【学位级别】：硕士
【学位授予年份】：2010
【分类号】：TP393.098

【参考文献】