当前位置:主页 > 文艺论文 > 广告艺术论文 >

基于粗糙集理论的垃圾邮件识别方法

发布时间:2018-07-22 21:04
【摘要】:电子邮件在给人与人之间相互交流带来便利的同时也带来了困扰,一些为获得盈利的商家在互联网中向邮件用户发送大量广告邮件,一些不法之徒,利用电子邮件传播非法、反动以及诈骗等垃圾信息,这种做法不仅造成服务器的堵塞,更会对社会造成一定的危害。 目前主流的反垃圾邮件技术为基于邮件内容识别技术,但是这种技术需要进行大量的匹配运算,对CPU和内存的占用极高,,并且由于垃圾邮件发送者会变换不同的方式来伪装发送的垃圾邮件内容,所以随着时间的改变,基于内容的垃圾邮件识别效率会逐渐变低,因此本文将研究的重点转移到邮件的信头中。由于邮件信头的字段特征较为模糊,不同类别的邮件可能含有相同的信头特征,具有不确定性和不一致性,同时并非所有的邮件都含有定义的属性涉及到的字段,会有一部分属性值缺失的情况,因此提出一种基于粗糙集中不完备信息系统的相关理论的垃圾邮件识别方法。 首先对已经分类的邮件训练集进行特征提取,由于电子邮件的信头是半结构化文本,本文选择了9个能够反映邮件特征的信头字段,自主定义了24个特征属性,其中23个条件属性,1个决策属性,条件属性的属性值均为离散值,决策属性值根据样本本身的类别赋值。根据定义的特征属性对训练集中的邮件进行特征提取之后得到一个数据表,由于这个数据表中有一些获取不到的属性值,因此在粗糙集理论中称之为一个不完备信息系统。然后在特征选择阶段使用粗糙集理论中针对不完备系统的相关知识进行离散化和知识约简,最终获得一个可以用于分类的决策表,决策表中每一行都是一条规则,待识别样本通过与这个决策表中规则的规则前件进行字符匹配,找到相匹配的规则,则该条规则的后件即为邮件最终的类别。最后通过本文设计的实验对邮件识别的召回率和准确率进行计算和比较,对于不完备系统的处理方法来说,相比较传统的补齐方法,本文中的对等价关系进行扩充的方法更有效,针对其他的基于信头的识别方法SVM算法、决策树算法、贝叶斯算法和传统的粗糙集算法来说,本文的算法具有更高的召回率和准确率。本文的研究内容主要有以下几个方面: (1)定义用于特征提取的属性。 电子邮件中的信头是由若干头字段组成的,通过分析大量的垃圾邮件与正常邮件的信头得到9个出现概率较高头字段,为From、Sender、Reply-to、To、Delivered-To、Return-Path、Received、Message-ID、Date。并通过分析字段之间的关系自主定义了24个属性,包括23个条件属性和1个决策属性。 (2)改进不完备系统中非对称的相似关系。 根据本文中定义的属性进行特征提取之后的信息系统由于信头中一些字段的缺失导致得到了一个不完备信息系统,虽然本文中属性值当前不存在,但是可以根据样本之间的其他属性的属性值是否相同判定它们是否是同一个类别,因此将完备信息系统的等价关系进行扩充,本文在原有的非对称相似关系的基础上提出了一种改进的非对称相似关系,这种关系将代替完备系统中的等价关系对样本进行划分。 (3)利用粗糙集中不完备信息系统的相关理论进行特征选择与样本识别。 首先使用基于属性重要性的离散化算法对决策表进行离散化处理,基于属性重要性的离散化算法在离散的过程中始终坚持不改变决策表的分类能力,离散化后的决策表属性值种类更少,会有效增加识别率。然后基于本文希望属性约简得到的属性的值含有较少的缺失属性,定义了下近似完备重要度的概念,提出了基于下近似完备重要度的属性约简算法,值约简时使用改进的非对称相似关系对规则进行简化,符合改进的非对称相似关系的规则可以进行合并,并计算出每条规则的可信度。最后在样本识别的时候也是根据改进的非对称相似关系进行规则匹配,如有多条可匹配的规则,则选择可信度大的规则,若不存在这样的规则,则将样本加入未识别样本集中。 (4)对本文中提出的算法进行实验。 首先通过设置不同训练集数目进行实验,分别得出属性约简结果以及召回率、准确率和识别率,实验结果表明,本文提出的算法具有较好的稳定性,且对垃圾邮件的识别起到了很好的效果;其次,将本文算法与SVM算法、贝叶斯算法、决策树算法和传统的粗糙集算法进行对比,本文算法召回率和准确率达到了87.10%和89.01%,优于其他的算法。 综上所述,本文首次将粗糙集理论中不完备信息系统的处理方法应用于垃圾邮件信头识别的领域中,并使用粗糙集理论中不完备信息系统的离散化、知识约简以及识别方法进行获取决策表以及识别邮件,通过两个实验进行验证本文提出的方法的有效性,实验结果无论是从召回率和准确率来看本文方法都能够获得令人满意的效果,为垃圾邮件过滤的进一步研究奠定了基础。
[Abstract]:E - mail brings convenience to people and people, but it also brings trouble. Some profitable businesses send a lot of mail to mail users on the Internet. Some of them use e-mail to disseminate illegal, reactionary and fraudulent information, which not only causes the congestion of the server, but also the illegal, reactionary and fraudulent information. It will cause certain harm to society.
At present, the mainstream anti spam technology is based on mail content recognition technology, but this technology requires a large number of matching operations, high occupancy for CPU and memory, and because the spam sender will change different ways to disguise the sent spam content, so the content based spam mail is changed over time. The efficiency of part recognition will be gradually reduced, so the focus of this study is transferred to the letter head of the mail. Because the field features of the mail message head are more fuzzy, the different types of mail may contain the same feature, which has the uncertainty and inconsistency, and not all mail contains the fields involved in the defined attributes. There are some missing attribute values. Therefore, a spam identification method based on the theory of incomplete information system in rough set is proposed.
First of all, the feature extraction of the mail training set has been classified. Since the e-mail message header is semi structured text, this paper selects 9 letter header fields that can reflect the mail features, and defines 24 characteristic attributes independently, including 23 condition attributes and 1 decision attributes, and the attribute values of the attributes are both discrete values and the decision attribute values root. According to the category assignment of the sample itself, a data table is obtained after the characteristics of the defined feature attribute to the training of the training set. Because there are some attribute values that can not be obtained in the data table, it is called an incomplete information system in the rough set theory. Then the rough set theory is used in the feature selection stage. The relevant knowledge of incomplete system is discretized and knowledge reduction, and a decision table can be obtained. Each line in the decision table is a rule. The sample to be identified is matched by the rule of the rule in the decision table to find the matching rules, then the post of the rule is mail. In the end, the recall rate and accuracy rate of mail recognition are calculated and compared by the experiment designed in this paper. For the incomplete system processing methods, the traditional method is compared with the traditional method. The method of expanding the equivalence relation is more effective in this paper, and the SVM algorithm based on the other recognition method based on the letter head is used. The algorithm of decision tree, Bias algorithm and the traditional rough set algorithm have higher recall and accuracy. The main contents of this paper are as follows:
(1) define the attributes used for feature extraction.
The head of e-mail is made up of several header fields. By analyzing a large number of spam and normal mail, 9 higher probability head fields are obtained, and 24 attributes are defined for From, Sender, Reply-to, To, Delivered-To, Return-Path, Received, Message-ID, Date. and through the analysis fields, including 24 attributes, including 23 A conditional attribute and 1 decision attributes.
(2) improve the asymmetric similarity relation in incomplete systems.
The information system which is extracted from the attributes defined in this paper has an incomplete information system due to the lack of some fields in the header. Although the attribute values in this paper do not exist at present, it is possible to determine whether they are the same category according to the same attribute values of the other attributes between the samples. This paper extends the equivalence relation of the complete information system. In this paper, an improved asymmetric similarity relation is proposed on the basis of the original asymmetric similarity relation, which will replace the equivalent relation in the complete system to the sample.
(3) feature selection and sample recognition based on the theory of incomplete information system in rough set.
The discretization algorithm based on attribute importance is used to discretize the decision table, and the discretization algorithm based on the importance of attribute always persists without changing the classification ability of the decision table. The attribute value of the decision table after discretization is less, and the recognition rate will be added effectively. The value of the obtained attributes contains less missing attributes, defines the concept of lower approximate complete importance, proposes an attribute reduction algorithm based on the lower approximation complete importance, and simplifies the rules using an improved asymmetric similarity relation in value reduction. The rules of the improved asymmetric similarity can be merged and calculated. The reliability of each rule. Finally, when the sample is identified, the rules match according to the improved asymmetric similarity relation. If there are many matching rules, the rules with large credibility are selected. If there is no such rule, the sample is added to the unidentified sample set.
(4) experiment the algorithm proposed in this paper.
First, by setting the number of different training sets, the results of attribute reduction, recall, accuracy and recognition rate are obtained. The experimental results show that the algorithm proposed in this paper has good stability, and it has a good effect on the identification of spam. Secondly, the algorithm and SVM algorithm, Bias algorithm, decision tree are used. Compared with the traditional rough set algorithm, the recall rate and accuracy of the algorithm reach 87.10% and 89.01%, which are better than other algorithms.
To sum up, this paper applies the processing method of incomplete information system in the rough set theory for the first time in the domain of spam mail recognition, and uses the discretization of incomplete information systems in rough set theory, knowledge reduction and recognition method to obtain decision table and recognition mail. This paper is verified by two experiments. The effectiveness of the proposed method, the experimental results, both from the recall rate and the accuracy rate, can achieve satisfactory results and lay a foundation for further research on spam filtering.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.098

【参考文献】

相关期刊论文 前8条

1 王国胤;Rough集理论在不完备信息系统中的扩充[J];计算机研究与发展;2002年10期

2 李志君;王国胤;吴渝;;基于Rough Set的电子邮件分类系统[J];计算机科学;2004年03期

3 邓维斌;王国胤;洪智勇;;基于粗糙集的加权朴素贝叶斯邮件过滤方法[J];计算机科学;2011年02期

4 周念念,冉蜀阳,曾剑宇,钟响;基于人工免疫的反垃圾邮件系统模型[J];计算机应用;2005年11期

5 常犁云,263.net,王国胤,263.net,吴渝,263.net;一种基于Rough Set理论的属性约简及规则提取方法[J];软件学报;1999年11期

6 黄海;王国胤;吴渝;;一种不完备信息系统的直接约简方法[J];小型微型计算机系统;2005年10期

7 朱颢东;钟勇;;一种无决策属性的信息系统的属性约简算法[J];小型微型计算机系统;2010年02期

8 谭营;朱元春;;反垃圾电子邮件方法研究进展[J];智能系统学报;2010年03期

相关博士学位论文 前1条

1 裴小兵;粗糙集的知识约简研究[D];华中科技大学;2006年

相关硕士学位论文 前7条

1 费巧玲;安全电子邮件解决方案与系统实现[D];湖南大学;2006年

2 张耀龙;行为识别技术在反垃圾邮件系统中的研究与应用[D];北京邮电大学;2006年

3 潘文锋;基于内容的垃圾邮件过滤研究[D];中国科学院研究生院(计算技术研究所);2004年

4 钱诚慎;SMTP电子邮件客户端与服务器的设计与实现[D];大连理工大学;2006年

5 侯岩;基于SVM的中文电子邮件过滤方法研究[D];山西大学;2008年

6 欧红星;电子邮件安全过滤与检查技术研究[D];中南大学;2008年

7 王芸;基于Rough集的垃圾邮件过滤技术的研究与应用[D];南昌大学;2008年



本文编号:2138568

资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/2138568.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户5a065***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com