基于改进Winnow算法的中文反垃圾邮件系统的研究与实现

发布时间：2018-07-15 07:05

【摘要】： 随着Internet的广泛应用,电子邮件已经是人们日常生活网络交流的重要途径。然而垃圾邮件作为商业广告、病毒程序或敏感内容的载体,已经对系统安全形成威胁,并且给人们的生活带来不便。反垃圾邮件问题已成为全球性的具有重大现实意义的课题。本文深入研究了垃圾邮件内容过滤技术,结合中文垃圾邮件的特点,设计并实现了一种基于自动分类技术的中文反垃圾邮件过滤引擎。该引擎分成预处理、训练、分类和反馈四个部分。在预处理方面,本文分别研究了邮件解码、中文分词、特征提取以及邮件的向量表示等子模块。对于中文分词,本引擎采用了中科院的汉语词法分析系统ICTCLAS;对于特征提取,采用了互信息值的方法。训练及分类是本文研究的重点。首先,对基本Winnow算法的指数形式和因子形式进行了统一,并由此推导出了Balanced Winnow算法的指数形式;其次,鉴于基本Winnow算法的抖动现象,提出了一种改进Winnow的反垃圾邮件过滤算法——Review Winnow,该算法不仅有效地缓减了抖动现象,而且所选用的损失函数能更真实地描述分类错误邮件的内在损失;再次,通过去除邮件样本集中存在的野点和利用改良的Boosting算法,提升了Winnow分类器的性能,并由此构建了ADOR-Winnow邮件分类器;最后实验证明,Balanced R-Winnow算法有效地缓减了抖动现象,ADOR-Winnow邮件分类器极大地提高了分类器性能。在反馈方面,本文提出了一种基于网格的反馈学习模型。该模型通过用户分类,将反馈级别从一般的两级延伸到系统级、域级、用户级三级。这种改进不仅有利于组间的协同过滤及集中式的反馈学习,而且有利于提高邮件分类器的过滤性能。
[Abstract]:With the wide application of Internet, email has become an important way for people to communicate with each other in daily life. However, spam, as a carrier of commercial advertisements, virus programs or sensitive content, has posed a threat to the security of the system and brought inconvenience to people's lives. Anti-spam problem has become a global issue of great practical significance. In this paper, the technology of spam content filtering is deeply studied, and a Chinese anti-spam filtering engine based on automatic classification technology is designed and implemented according to the characteristics of Chinese spam. The engine is divided into four parts: preprocessing, training, classification and feedback. In the aspect of preprocessing, this paper studies the sub-modules of mail decoding, Chinese word segmentation, feature extraction and vector representation of mail. For Chinese word segmentation, the engine adopts ICTCLAS-based Chinese lexical analysis system of Chinese Academy of Sciences, and uses mutual information value method for feature extraction. Training and classification are the focus of this paper. Firstly, the exponential form and the factor form of the basic winnow algorithm are unified, and the exponential form of the balanced winnow algorithm is deduced. Secondly, in view of the jitter of the basic winnow algorithm, This paper presents an improved winnow anti-spam filtering algorithm, Review Winnow. this algorithm not only effectively reduces the jitter phenomenon, but also the loss function selected can describe the inner loss of classification error mail more truthfully. By removing the outliers in the mail sample set and using the improved boosting algorithm, the performance of winnow classifier is improved, and the ADOR-winnow mail classifier is constructed. Finally, experiments show that the balanced R-Winnow algorithm can effectively reduce the jitter phenomenon and greatly improve the performance of ADOR-Winnow mail classifier. In terms of feedback, a grid-based feedback learning model is proposed. Through user classification, the feedback level is extended from general two levels to system level, domain level and user level. This improvement is not only conducive to cooperative filtering among groups and centralized feedback learning, but also helps to improve the filtering performance of mail classifiers.
【学位授予单位】：南京航空航天大学
【学位级别】：硕士
【学位授予年份】：2008
【分类号】：TP393.098

【相似文献】