基于文本分类技术的垃圾邮件过滤研究

发布时间：2018-04-28 18:08

本文选题：垃圾邮件 + 互信息　；参考：《安徽大学》2017年硕士论文

【摘要】：随着互联网广告技术的发展和E-mail的普及,垃圾邮件广告越来越严重,如何准确过滤垃圾邮件直接影响用户体验。本论文是在前人的理论与研究基础上,系统的学习了垃圾邮件分类的方法,主要分析的重点是朴素贝叶斯分类方法在垃圾邮件过滤上的研究。本文首先从定义、特征以及危害等方面对垃圾邮件进行了一个概述,分析了国内外垃圾邮件研究的现状,介绍了基于信件源以及基于内容的两种垃圾邮件过滤方法。其中基于内容统计的朴素贝叶斯分类方法其比较高效、经济并且易于实现的优点,而在垃圾邮件过滤的研究中得到了广泛的应用。接着介绍了文本分类的关键技术,有文本预处理、文本特征选择、文本表示方法以及文本分类算法。最后用实验证明本文在传统的朴素贝叶斯分类的基础上提出的几大改进地方使分类性能得到了提高。鉴于保证邮件准确分类的重要性以及数据的真实性以及权威性,本文利用Apache SpamAssassin Project数据设计了五组对比实验。实验一用没有经过任何处理的数据直接建立伯努利朴素贝叶斯分类模型,由于词典单词量大,从而联合概率分布计算量大,超出了计算机现有的计算能力,在计算文本被判为某类别的概率的过程中很容易超出浮点数的范围,使计算结果为零,影响分类准确率。故本文优化了计算过程,转而计算文本被判为正常邮件概率与被判为垃圾邮件概率之比,将分类正确率从88.3%提升到92.3%。虽然经过巧妙的比值计算处理,最大限度的利用了浮点数的存取范围,但该概率的比值还是会出现为零和为无穷大的情况,故需要降低文本特征维度。实验二首先是按照传统方法去除停用词,结果发现准确率反而降低了,说明有些停用词对文本分类还是有一定的贡献的,进而转向特征提取方法。实验三根据互信息特征提取方法作出了改进,提出"相对依存度、分类能力、综合分类能力"等概念,提出另一种分类依据,并将其与互信息方法进行对比,发现当都选取一万左右的特征词时,准确率从87.8%提高到了 96.6%。改进后的方法可以提取出综合分类能力最大的特征集,但是对于给定的测试邮件,它的分类能力并不是最大的。故本文在此基础上进行了深入探讨,实验四对特征选择的计算方法又作出了改进,并称之为自适应特征选择,实验结果是分类准确率普遍得到了提高。在特征集维度合适的情况下,实验五为了降低朴素贝叶斯中各属性特征间相互独立这一严格的假设,通过为每个属性建立一个隐藏的父节点来描述该属性与其他属性间的一种依赖关系,并称之单隐朴素贝叶斯,最后在该模型的计算方法又提出了略小的改进,实验结果表面,分类准确率得到了提高。为提高准确性,所有实验均采取的是十折交叉验证。
[Abstract]:With the development of Internet advertising technology and the popularity of E-mail, spam advertising is becoming more and more serious. How to filter spam directly affects the user experience. On the basis of previous theories and researches, this paper systematically studies the method of spam classification. The emphasis of this paper is the research of naive Bayes classification in spam filtering. In this paper, firstly, the definition, characteristics and harm of spam are summarized, the current situation of spam research at home and abroad is analyzed, and two spam filtering methods based on mail source and content are introduced. Among them, the naive Bayesian classification method based on content statistics has the advantages of high efficiency, economy and easy implementation, and has been widely used in the research of spam filtering. Then the paper introduces the key technologies of text classification, including text preprocessing, text feature selection, text representation and text classification algorithm. Finally, it is proved by experiments that the performance of the classification is improved by several improvements proposed in this paper based on the traditional naive Bayes classification. In view of the importance of accurate classification of mail and the authenticity and authority of the data, this paper designs five sets of comparative experiments using Apache SpamAssassin Project data. In experiment one, Bernoulli naive Bayes classification model is built directly from the data without any processing. Because of the large number of words in the dictionary, the calculation of joint probability distribution is large, which is beyond the existing computing ability of the computer. In the process of calculating the probability that the text is judged as a certain class, it is easy to exceed the range of floating-point, so that the result of calculation is zero, which affects the accuracy of classification. Therefore, this paper optimizes the calculation process, and then calculates the ratio of the probability of the text being judged as normal mail to the probability of being judged as spam, and raises the classification accuracy rate from 88.3% to 92.3%. Although the access range of floating-point number is utilized to the maximum extent through the skillful ratio calculation, the ratio of this probability still appears to be zero sum and infinity, so it is necessary to reduce the text feature dimension. The second experiment is to remove the stop word according to the traditional method. The result shows that some stop words have some contribution to text classification and then turn to the feature extraction method. In the third experiment, according to the mutual information feature extraction method, the concepts of "relative dependency, classification ability, comprehensive classification ability" are proposed, and another classification basis is put forward, and compared with the mutual information method. It was found that the accuracy increased from 87.8% to 96. 6% when we selected about 10, 000 feature words. The improved method can extract the feature set with the maximum ability of synthesis classification, but its classification ability is not the maximum for a given test mail. Therefore, this paper makes a thorough discussion on this basis, and the calculation method of experimental four pairs of feature selection is improved, which is called adaptive feature selection. The experimental result is that the accuracy of classification is generally improved. In order to reduce the strict assumption that the attributes in naive Bayes are independent of each other when the dimension of the feature set is appropriate, By establishing a hidden parent node for each attribute to describe a dependency between the attribute and other attributes, and call it a single hidden naive Bayes, a slight improvement is proposed in the calculation method of the model. The classification accuracy is improved. In order to improve the accuracy, all the experiments were carried out by 10% cross-validation.
【学位授予单位】：安徽大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.098;TP391.1

【相似文献】