基于贝叶斯算法的垃圾邮件过滤系统的研究与实现
发布时间:2018-05-26 14:48
本文选题:贝叶斯算法 + 邮件过滤器 ; 参考:《电子科技大学》2010年硕士论文
【摘要】: 随着网络的发展,电子邮件已经成为人们日常生活中不可或缺的通讯方式,电子邮件以其方便快捷的特点受到广大网民的青睐。电子邮件成为人们的主要通讯方式是一种必然的发展趋势,但是随着电子邮件的日益流行,一些不法分子利用这个机会趁机发送垃圾邮件,向人们传递广告或者非法视频、图片等信息,既浪费用户的时间,又浪费网络流量,给用户的正常工作、生活、学习带来了极大的不便,因此寻找一种切实可行且高效的反垃圾邮件技术变得尤为重要。 本文应实验室项目的需求,深入研究了国内外的主要反垃圾邮件技术,设计了一个反垃圾邮件系统。通过分析比较当前的主流反垃圾邮件技术,本系统决定采用过滤效率较好的基于内容的反垃圾邮件技术。在基于内容的反垃圾邮件技术中,贝叶斯算法的分类效果明显优于其他分类算法,所以本文设计并实现了一个基于贝叶斯算法的垃圾邮件过滤器,并针对传统贝叶斯过滤器的不足提出了改进措施,通过将本过滤器改造为基于双层架构的邮件过滤系统来进一步提高系统的准确率,并解决了在实现贝叶斯过滤器增量学习的过程中遇到的问题。 主要工作如下: (1)研究最新的邮件过滤技术,并比较各种技术的优缺点,以选择效果最好的过滤技术进行实现。 (2)研究并分析了常用的邮件预处理技术(包括邮件解码,中文分词,特征词提取等),并从中选取适用于本系统的技术。 (3)实现了邮件预处理过程,包括邮件解码实现,中文分词实现,特征词提取技术的实现等。 (4)实现了垃圾邮件过滤器,包括训练过程和测试过程,并通过大量的实验确定了个参数值的最佳状态,从而提高了整个系统的准确率。 (5)分析贝叶斯过滤器的优缺点并提出了改进措施,解决了贝叶斯过滤器在实现增量学习时遇到的一些问题。
[Abstract]:With the development of network, email has become an indispensable communication mode in people's daily life. It is an inevitable trend to use email as the main way of communication, but with the increasing popularity of email, some criminals take advantage of this opportunity to send spam and send advertisements or illegal videos to people. Pictures and other information not only waste the user's time but also waste the network traffic and bring great inconvenience to the user's normal work life and study. So it is very important to find a feasible and efficient anti-spam technology. In this paper, the main anti-spam technology at home and abroad is studied, and an anti-spam system is designed. By analyzing and comparing the current mainstream anti-spam technology, this system decided to adopt the content-based anti-spam technology which has better filtering efficiency. In the content-based anti-spam technology, Bayesian algorithm is obviously superior to other classification algorithms, so this paper designs and implements a spam filter based on Bayesian algorithm. Aiming at the shortcomings of the traditional Bayesian filter, the improvement measures are put forward, which can further improve the accuracy of the system by transforming the filter into a two-layer structure based mail filtering system. The problems encountered in the process of realizing Bayesian filter incremental learning are solved. The main tasks are as follows: 1) researching the latest email filtering technology, and comparing the advantages and disadvantages of the various technologies to select the best filtering technology to achieve. 2) the common preprocessing techniques (including mail decoding, Chinese word segmentation, feature extraction and so on) are studied and analyzed, and the technologies suitable for this system are selected. 3) the preprocessing process of mail is realized, including the realization of mail decoding, the realization of Chinese word segmentation and the realization of feature extraction technology. 4) the spam filter is implemented, including the training process and the testing process, and the best state of a parameter value is determined through a large number of experiments, which improves the accuracy of the whole system. 5) the advantages and disadvantages of Bayesian filter are analyzed, and the improvement measures are put forward to solve some problems encountered by Bayesian filter in realizing incremental learning.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2010
【分类号】:TP393.098
【参考文献】
相关期刊论文 前10条
1 李文斌,刘椿年,黄佳进;基于数据挖掘的垃圾E-mail过滤方法[J];北京工业大学学报;2003年02期
2 朱骏,陈刚;一种高效的智能内容过滤模型[J];计算机工程;2003年21期
3 蔡立军,施荣华;一种新的电子邮件过滤系统模型的设计[J];计算机工程;2003年16期
4 王波,黄迪明;反垃圾邮件技术网络部署研究[J];计算机应用;2004年S2期
5 谭立球,谷士文,费耀平;个人化电子邮件自动过滤系统的设计[J];计算机应用;2002年06期
6 张学工;关于统计学习理论与支持向量机[J];自动化学报;2000年01期
7 游荣彦,邓志才,李传宏;向量空间模型中特征词的区分度的定量研究[J];中文信息学报;2002年03期
8 蒋建春,马恒太,任党恩,卿斯汉;网络安全入侵检测:研究综述[J];软件学报;2000年11期
9 刘洋,杜孝平,黄星华,侯志辉,郭晨,周二胜,骆焕林;垃圾邮件的智能过滤系统设计探讨[J];微机发展;2003年04期
10 周威成,马素霞,齐林海;一种基于机器学习的垃圾邮件智能过滤方法[J];现代电力;2003年01期
相关硕士学位论文 前1条
1 胡蓉;中文Web文档倾向性自动分类研究[D];四川大学;2003年
,本文编号:1937630
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1937630.html