基于贝叶斯算法的垃圾邮件过滤系统设计与实现
发布时间:2018-03-26 19:04
本文选题:邮件协议 切入点:贝叶斯过滤器 出处:《吉林大学》2014年硕士论文
【摘要】:伴随着互联网的大爆发,电子邮件成为人们日常沟通交流的重要方式,正是因为电子邮件有着无与伦比的优势像收发容易、操作简单、费用低廉等,所以众多网络用户将电子邮件列为他们的首选联系方式。然而伴随着网络邮件的发展,我们的邮箱经常会收到不认识的人或地址发来的邮件。这种邮件以各种广告信息为主例如免费通话、打折商品、各种非法信息等。这些邮件可能与你的工作与生活毫不相干,或就是你十分厌恶的,但类似的这些邮件每天都“执着的”丰富着你的邮箱,打扰着你的生活,有时候它还会带来病毒使计算机中毒导致瘫痪。这种强行进入到用户电子邮箱里的邮件就是所谓的垃圾邮件(UBE,Unsolicited Bulk Email)或者又称为商业宣传邮件(Unsolicited Commercial Email,指以宣传商品为主要内容的邮件)。 鉴于垃圾邮件给现代社会造成了极大的危害,研究如何更好的抑制垃圾邮件的滥发变得愈发紧迫,国际化的反垃圾邮件技术一直是人们讨论的热点话题。本论文在基于前人的理论与研究基础之上,系统的学习了电子邮件的理论与国际上的垃圾邮件过滤方法,主要分析的重点是朴素贝叶斯算法对垃圾邮件的分类研究。论文首先介绍了电子邮件的发展历程及电子邮件的工作原理,介绍了电子邮件中常用到的几种协议,比如MIME(Multipurpose Internet Mail Extensions)、SMTP(Simple MailTransfer Protocol)。其次介绍了基于规则的垃圾邮件过滤,分别有发送者邮件地址分析、接收者邮件地址过滤、黑白名单过滤、邮件主题过滤等。这些相关的规则集组成了反垃圾邮件的第一道防线。最后重点介绍了基于内容的朴素贝叶斯算法应用于垃圾邮件过滤,根据朴素贝叶斯算法的不足做出了一些改进。对中文分词的几种获取方法进行了相关的介绍,主要有词典中文分词方法、N-gram方法和人工分词等,然后建立能表征邮件文本内容的特征向量,,对已知分类的邮件语料进行系统学习,利用朴素贝叶斯理论对新收到的电子封邮件进行判别归类,最终将电子邮件呈现给用户为垃圾邮件还是正常邮件。 最后在理论与相关的技术的结合下,本文给出了一个朴素贝叶斯对垃圾邮件分类的模拟,通过对邮件样本学习进行垃圾邮件过滤,垃圾邮件和正常邮件的比例参考了《中国反垃圾邮件状况调查报告》中垃圾邮件占用户邮件中的百分比,通过实验得到的数据反映了该方法对垃圾拦截的有效性。
[Abstract]:With the explosion of the Internet, email has become an important way of daily communication and communication. It is precisely because e-mail has unparalleled advantages, such as easy to send and receive, simple operation, low cost, etc. So many Internet users list email as their preferred contact information. However, with the development of online mail, Our email box often receives emails from people or addresses we don't know. They are based on a variety of advertising messages, such as free calls, discounted items, illegal messages, etc. These emails may have nothing to do with your work or life. Or what you hate very much, but these emails are "persistent" to enrich your mailbox and disturb your life every day. Sometimes it can also cause computer poisoning and paralysis. This forced entry into a user's e-mail is known as a spam message or a commercial promotion message. For the main content of the mail. In view of the great harm that spam has done to modern society, it is becoming increasingly urgent to study how to better curb spam spamming. International anti-spam technology has always been a hot topic of discussion. Based on previous theories and research, this paper systematically studied the theory of email and the international spam filtering methods. The emphasis of this paper is on the classification of spam by naive Bayes algorithm. Firstly, this paper introduces the development of email and the working principle of email, and introduces several protocols that are often used in email. For example, MIME(Multipurpose Internet Mail extension is a simple MailTransfer protocol. Secondly, we introduce spam filtering based on rules, such as sender email address analysis, receiver email address filtering, black-and-white list filtering, etc. These related rule sets form the first line of defense against spam. Finally, the application of content based naive Bayes algorithm to spam filtering is introduced. According to the shortcomings of naive Bayes algorithm, some improvements are made. Several methods of Chinese word segmentation are introduced, such as dictionary Chinese word segmentation method, N-gram method and artificial word segmentation method, etc. Then the feature vector which can represent the content of email text is established, and the known classified email corpus is studied systematically, and the newly received email is classified by using naive Bayes theory. Finally, presenting email to the user as spam or normal mail. Finally, under the combination of theory and related technology, this paper presents a naive Bayes simulation of spam classification, through the study of email samples for spam filtering, The proportion of spam to normal mail refers to the percentage of spam in users' emails in China Anti-Spam Survey report. The experimental data show the effectiveness of this method in spam interception.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.098
【参考文献】
相关期刊论文 前10条
1 王峻;;一种基于属性相关性度量的朴素贝叶斯分类模型[J];安庆师范学院学报(自然科学版);2007年02期
2 董立岩;刘光远;苑森淼;李永丽;孙铭会;;混合式朴素贝叶斯分类模型[J];吉林大学学报(信息科学版);2007年01期
3 陈少飞,郝亚南,李天柱,徐林昊,杨文柱;Web信息抽取技术研究进展[J];河北大学学报(自然科学版);2003年01期
4 徐建民;刘清江;付婷婷;戴旭;;基于量化同义词关系的改进特征词提取方法[J];河北大学学报(自然科学版);2010年01期
5 刘静;余晓晔;丁立新;王振旗;;基于地址与内容过滤的垃圾邮件过滤器设计[J];华北电力大学学报;2006年03期
6 周茜,田忠和;基于SMTP组件的多功能邮件服务系统研究[J];华中理工大学学报;2000年10期
7 李荣陆,王建会,陈晓云,陶晓鹏,胡运发;使用最大熵模型进行中文文本分类[J];计算机研究与发展;2005年01期
8 司道浩;杨金升;;反垃圾邮件系统的内容过滤模块设计与实现[J];计算机与信息技术;2006年09期
9 熊忠阳;黎刚;陈小莉;陈伟;;文本分类中词语权重计算方法的改进与应用[J];计算机工程与应用;2008年05期
10 翟军昌;秦玉平;王春立;;改进的朴素贝叶斯垃圾邮件过滤算法[J];计算机工程与应用;2009年14期
本文编号:1669171
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1669171.html