基于粗糙集理论的垃圾邮件识别方法
[Abstract]:E - mail brings convenience to people and people, but it also brings trouble. Some profitable businesses send a lot of mail to mail users on the Internet. Some of them use e-mail to disseminate illegal, reactionary and fraudulent information, which not only causes the congestion of the server, but also the illegal, reactionary and fraudulent information. It will cause certain harm to society.
At present, the mainstream anti spam technology is based on mail content recognition technology, but this technology requires a large number of matching operations, high occupancy for CPU and memory, and because the spam sender will change different ways to disguise the sent spam content, so the content based spam mail is changed over time. The efficiency of part recognition will be gradually reduced, so the focus of this study is transferred to the letter head of the mail. Because the field features of the mail message head are more fuzzy, the different types of mail may contain the same feature, which has the uncertainty and inconsistency, and not all mail contains the fields involved in the defined attributes. There are some missing attribute values. Therefore, a spam identification method based on the theory of incomplete information system in rough set is proposed.
First of all, the feature extraction of the mail training set has been classified. Since the e-mail message header is semi structured text, this paper selects 9 letter header fields that can reflect the mail features, and defines 24 characteristic attributes independently, including 23 condition attributes and 1 decision attributes, and the attribute values of the attributes are both discrete values and the decision attribute values root. According to the category assignment of the sample itself, a data table is obtained after the characteristics of the defined feature attribute to the training of the training set. Because there are some attribute values that can not be obtained in the data table, it is called an incomplete information system in the rough set theory. Then the rough set theory is used in the feature selection stage. The relevant knowledge of incomplete system is discretized and knowledge reduction, and a decision table can be obtained. Each line in the decision table is a rule. The sample to be identified is matched by the rule of the rule in the decision table to find the matching rules, then the post of the rule is mail. In the end, the recall rate and accuracy rate of mail recognition are calculated and compared by the experiment designed in this paper. For the incomplete system processing methods, the traditional method is compared with the traditional method. The method of expanding the equivalence relation is more effective in this paper, and the SVM algorithm based on the other recognition method based on the letter head is used. The algorithm of decision tree, Bias algorithm and the traditional rough set algorithm have higher recall and accuracy. The main contents of this paper are as follows:
(1) define the attributes used for feature extraction.
The head of e-mail is made up of several header fields. By analyzing a large number of spam and normal mail, 9 higher probability head fields are obtained, and 24 attributes are defined for From, Sender, Reply-to, To, Delivered-To, Return-Path, Received, Message-ID, Date. and through the analysis fields, including 24 attributes, including 23 A conditional attribute and 1 decision attributes.
(2) improve the asymmetric similarity relation in incomplete systems.
The information system which is extracted from the attributes defined in this paper has an incomplete information system due to the lack of some fields in the header. Although the attribute values in this paper do not exist at present, it is possible to determine whether they are the same category according to the same attribute values of the other attributes between the samples. This paper extends the equivalence relation of the complete information system. In this paper, an improved asymmetric similarity relation is proposed on the basis of the original asymmetric similarity relation, which will replace the equivalent relation in the complete system to the sample.
(3) feature selection and sample recognition based on the theory of incomplete information system in rough set.
The discretization algorithm based on attribute importance is used to discretize the decision table, and the discretization algorithm based on the importance of attribute always persists without changing the classification ability of the decision table. The attribute value of the decision table after discretization is less, and the recognition rate will be added effectively. The value of the obtained attributes contains less missing attributes, defines the concept of lower approximate complete importance, proposes an attribute reduction algorithm based on the lower approximation complete importance, and simplifies the rules using an improved asymmetric similarity relation in value reduction. The rules of the improved asymmetric similarity can be merged and calculated. The reliability of each rule. Finally, when the sample is identified, the rules match according to the improved asymmetric similarity relation. If there are many matching rules, the rules with large credibility are selected. If there is no such rule, the sample is added to the unidentified sample set.
(4) experiment the algorithm proposed in this paper.
First, by setting the number of different training sets, the results of attribute reduction, recall, accuracy and recognition rate are obtained. The experimental results show that the algorithm proposed in this paper has good stability, and it has a good effect on the identification of spam. Secondly, the algorithm and SVM algorithm, Bias algorithm, decision tree are used. Compared with the traditional rough set algorithm, the recall rate and accuracy of the algorithm reach 87.10% and 89.01%, which are better than other algorithms.
To sum up, this paper applies the processing method of incomplete information system in the rough set theory for the first time in the domain of spam mail recognition, and uses the discretization of incomplete information systems in rough set theory, knowledge reduction and recognition method to obtain decision table and recognition mail. This paper is verified by two experiments. The effectiveness of the proposed method, the experimental results, both from the recall rate and the accuracy rate, can achieve satisfactory results and lay a foundation for further research on spam filtering.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.098
【参考文献】
相关期刊论文 前8条
1 王国胤;Rough集理论在不完备信息系统中的扩充[J];计算机研究与发展;2002年10期
2 李志君;王国胤;吴渝;;基于Rough Set的电子邮件分类系统[J];计算机科学;2004年03期
3 邓维斌;王国胤;洪智勇;;基于粗糙集的加权朴素贝叶斯邮件过滤方法[J];计算机科学;2011年02期
4 周念念,冉蜀阳,曾剑宇,钟响;基于人工免疫的反垃圾邮件系统模型[J];计算机应用;2005年11期
5 常犁云,263.net,王国胤,263.net,吴渝,263.net;一种基于Rough Set理论的属性约简及规则提取方法[J];软件学报;1999年11期
6 黄海;王国胤;吴渝;;一种不完备信息系统的直接约简方法[J];小型微型计算机系统;2005年10期
7 朱颢东;钟勇;;一种无决策属性的信息系统的属性约简算法[J];小型微型计算机系统;2010年02期
8 谭营;朱元春;;反垃圾电子邮件方法研究进展[J];智能系统学报;2010年03期
相关博士学位论文 前1条
1 裴小兵;粗糙集的知识约简研究[D];华中科技大学;2006年
相关硕士学位论文 前7条
1 费巧玲;安全电子邮件解决方案与系统实现[D];湖南大学;2006年
2 张耀龙;行为识别技术在反垃圾邮件系统中的研究与应用[D];北京邮电大学;2006年
3 潘文锋;基于内容的垃圾邮件过滤研究[D];中国科学院研究生院(计算技术研究所);2004年
4 钱诚慎;SMTP电子邮件客户端与服务器的设计与实现[D];大连理工大学;2006年
5 侯岩;基于SVM的中文电子邮件过滤方法研究[D];山西大学;2008年
6 欧红星;电子邮件安全过滤与检查技术研究[D];中南大学;2008年
7 王芸;基于Rough集的垃圾邮件过滤技术的研究与应用[D];南昌大学;2008年
本文编号:2138568
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/2138568.html