用于垃圾邮件识别的“词频-筛”混合特征选择方法
发布时间:2018-06-12 04:31
本文选题:垃圾邮件识别 + 混合特征选择方法 ; 参考:《华南理工大学学报(自然科学版)》2017年03期
【摘要】:文中针对当下愈发泛滥的垃圾邮件,分别使用朴素贝叶斯分类和支持向量机分类法对当前日益泛滥的垃圾邮件进行识别、分类,将"词频-筛"混合特征选择方法应用于分类器模型中,以提高分类器的识别性能.同时,通过考虑更全面的分类概率情况,改进朴素贝叶斯分类模型,进一步提升朴素贝叶斯分类器的识别性能.最后通过实验得到了该垃圾邮件识别系统的准确率、召回率和F1值等分类识别性能指标.实验结果表明,"词频-筛"混合特征选择方法能有效提高垃圾邮件分类器的识别性能,而且使用成本敏感方法的分类输出调节模块也能大大降低分类器将正常邮件误判为垃圾邮件的概率,因此,文中设计的垃圾邮件识别系统具有较强的实用性,可以在实际工作、生活中使用.
[Abstract]:In this paper, we use naive Bayes classification and support vector machine classification to identify and classify the spam which is becoming more and more widespread. In order to improve the recognition performance of the classifier, the "word frequency sieve" hybrid feature selection method is applied to the classifier model. At the same time, by considering more comprehensive classification probability, the naive Bayesian classification model is improved to further improve the recognition performance of naive Bayesian classifier. Finally, the accuracy rate, recall rate and F1 value of the spam recognition system are obtained by experiments. The experimental results show that the mixed feature selection method of "word frequency sieve" can effectively improve the recognition performance of spam classifier. Moreover, the classification output adjustment module using the cost sensitive method can greatly reduce the probability that the classifier can misjudge the normal mail as spam. Therefore, the spam identification system designed in this paper has strong practicability and can be used in practice. Used in life.
【作者单位】: 华南理工大学软件学院∥广州市机器人软件及复杂信息处理重点实验室;
【基金】:广东省自然科学基金资助项目(2016A030310412) 广东高校省级重点平台及科研项目-青年创新人才类项目(2015KQNCX003) 广州市科技计划重点实验室项目(15180007);广州市科技计划项目(201707010223)~~
【分类号】:TP18;TP393.098
【相似文献】
相关期刊论文 前10条
1 王琳;陈伟萍;封化民;方勇;杨鼎才;;基于类别概念的特征选择方法[J];北京电子科技学院学报;2006年02期
2 毛俐e,
本文编号:2008363
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2008363.html