基于内容的垃圾邮件过滤技术研究
发布时间:2018-03-20 16:31
本文选题:垃圾邮件 切入点:邮件过滤 出处:《西南交通大学》2009年硕士论文 论文类型:学位论文
【摘要】: 随着计算机网络的飞速发展,电子邮件成了人们日常生活中不可或缺的通信方式。然而同时也有大量的垃圾邮件随之而生,这些垃圾邮件包含反动、诈骗、推销及非法出售等各种内容,在严重干扰人们正常通信的同时,也存在危害社会的隐患。最近的调查显示,在这些垃圾邮件中,文字仍然是其主要的传播形式,因此基于邮件内容的垃圾邮件过滤技术一直是反垃圾邮件的主要研究方向。 基于内容的垃圾邮件过滤技术主要分为分词、文本表示、特征选择和分类四大部分,很多研究人员在这四个方面做了大量工作,取得了很多成果。本论文对垃圾邮件过滤的这四个部分进行了原理分析,主要研究了其中的特征选择算法,并根据垃圾邮件过滤的特点对互信息特征选择算法进行了改进。 其中,简要地阐述了基于内容的垃圾邮件过滤技术的发展、应用和现状,描述了各个环节的算法原理。在分词部分,在对垃圾邮件内容分析的基础上,对传统的分词算法增加了分词预处理环节,并给出了新的分词算法流程。在特征选择部分,重点论述了互信息算法在垃圾邮件过滤中的应用,从频度、分散度和集中度三个方面对传统互信息算法进行了分析和改进,在传统互信息算法中加入了词频因子,采用类别贡献比来衡量特征对类别贡献的差别,并采用真实邮件集在MATLAB上进行了仿真实验。在文本分类部分,分析了bayes分类算法在垃圾邮件过滤中的应用,并选择朴素bayes分类算法在weka环境中进行了邮件分类实验。 将改进算法的实验与传统互信息算法以及其他文献中的实验进行对比,对比结果表明,在维数压缩率相近的条件下,改进后的互信息算法显著提高了垃圾邮件的查准率和查全率,为后续的邮件分类环节提供了更好的基础。
[Abstract]:With the rapid development of computer network, email has become an indispensable means of communication in people's daily life. However, at the same time, there is also a large number of spam, which includes reactionary, fraud, Marketing and illegal sales of all kinds of content, while seriously interfering with people's normal communications, but also harmful to society. Recent surveys show that in these spam, text is still its main form of dissemination. Therefore, spam filtering technology based on email content has been the main research direction of anti-spam. The content based spam filtering technology is mainly divided into four parts: word segmentation, text representation, feature selection and classification. Many researchers have done a lot of work in these four areas. In this paper, the four parts of spam filtering are analyzed, the feature selection algorithm is studied, and the mutual information feature selection algorithm is improved according to the characteristics of spam filtering. In this paper, the development, application and current situation of content-based spam filtering technology are briefly described, and the algorithm principle of each link is described. In the part of word segmentation, based on the analysis of spam content, In the part of feature selection, the application of mutual information algorithm in spam filtering is discussed. This paper analyzes and improves the traditional mutual information algorithm from three aspects of dispersion and concentration. The word frequency factor is added to the traditional mutual information algorithm, and the category contribution ratio is used to measure the difference of the feature contribution to the category. In the part of text classification, the application of bayes classification algorithm in spam filtering is analyzed, and the simple bayes classification algorithm is selected to carry out the mail classification experiment in weka environment. The experiments of the improved algorithm are compared with those of the traditional mutual information algorithm and other literatures. The comparison results show that, under the condition of similar dimension compression ratio, The improved mutual information algorithm can significantly improve the precision and recall rate of spam, and provide a better basis for the subsequent mail classification.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2009
【分类号】:TP393.098
【引证文献】
相关期刊论文 前1条
1 王园;龚尚福;;基于二次TF* IDF的互信息文本特征选择算法研究[J];计算机应用与软件;2011年04期
相关硕士学位论文 前4条
1 徐丽平;基于内容挖掘的中文垃圾邮件过滤技术研究[D];东北财经大学;2010年
2 宋兴祖;一种改进的TF-IDF算法实现及其在垃圾邮件识别中的应用[D];吉林大学;2012年
3 梁婷;基于内容的垃圾邮件过滤技术研究[D];华东师范大学;2013年
4 祝冰洋;粒子群优化的SVM垃圾邮件过滤研究[D];郑州大学;2013年
,本文编号:1639909
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1639909.html