一种改进的TF-IDF算法实现及其在垃圾邮件识别中的应用
发布时间:2019-05-20 15:40
【摘要】:互联网技术将21世纪带入了信息时代,它使信息的产生和传播变得前所未有的便捷。然而互联网技术也是一把双刃剑,正是由于这种信息产生和传播方面的便捷也同时导致了垃圾信息的泛滥。从这些浩如烟海的信息中识别出垃圾信息并加以排除正日益成为目前计算机领域研究的热点问题之一。与此同时电子邮件业务作为互联网技术中最重要的业务之一,也被垃圾信息不断的干扰。由此,需要找到一种切实可行的方法对垃圾邮件进行识别和分离,以保障正常的通信和工作需要。 本文提出了一种基于改进TF-IDF(term frequency inverse document frequency)算法的垃圾邮件识别策略。该策略是基于在搜索引擎领域应用较为广泛的TF-IDF算法改进的,本文针对该算法对垃圾邮件特征词选取不全面,,特征词区分度不够等问题,引入了对特征项在各类之间分布,以及内容、位置权重的考量。本文中主要的改进策略有以下几点: 1.通过对TF-IDF算法中权值引入信息熵系数进行特征修正; 2.其次,我们针对传统TF-IDF算法中对内容和位置权重考虑不足的情况,在IDF值计算过程中引入位置和内容权值进行修正; 3.本文引入了独立性系数的概念作为衡量特征词条与所分类别之间关联性的参 数。 4.最后,根据垃圾邮件识别的二元分类特征,简化了IDF值计算的相应的参数。 5.通过对语料库中的数据进行对比试验表明,改进的TF-IDF算法比传统的TF-IDF 算法在召回率,错误率以及F1值等方面的指标均有较大提高。 进一步,我们引入了机器学习中的支持向量机理论,应用改进后的TF-IDF算法建立了一个对垃圾邮件进行识别分类模型。该模型包含三个主要模块:训练模块,测试模块和统计模块。他们分别通过对邮件进行文本分词,特征词条的提取和筛选,转换数据模式进行相似度比较实现了对系统的训练、对未知邮件的分类判定和对邮件数据统计的相关工作。我们通过使用语料库中的测试邮件集合对系统进行测试,实验证明我们实现的中文垃圾邮件识别系统能够基本有效的对大部分垃圾邮件进行识别和隔离。与基于传统的TF-IDF算法以及腾讯公司曾经使用过的垃圾邮件识别系统相比有显著的提高,基本实现了对用户垃圾邮件进行筛选分离,保障用户正常通信工作的需求。
[Abstract]:Internet technology brings the 21 st century into the information age, which makes the generation and dissemination of information more convenient than ever. However, Internet technology is also a double-edged sword, precisely because of the convenience of the generation and dissemination of this information, but also led to the proliferation of junk information. It is increasingly becoming one of the hot issues in the field of computer research to identify and eliminate garbage information from these vast amounts of information. At the same time, e-mail business, as one of the most important services in Internet technology, is also constantly interfered with by spam. Therefore, it is necessary to find a feasible method to identify and separate spam in order to ensure normal communication and work needs. In this paper, a spam recognition strategy based on improved TF-IDF (term frequency / inverse document frequency) algorithm is proposed. This strategy is based on the improvement of TF-IDF algorithm, which is widely used in the field of search engines. In this paper, in order to solve the problems of incomplete selection of spam feature words and insufficient discrimination of feature words, the distribution of feature items among various categories is introduced. And the consideration of content and position weight. The main improvement strategies in this paper are as follows: 1. The characteristic correction of information entropy coefficient is carried out by introducing information entropy coefficient into TF-IDF algorithm. Secondly, in view of the insufficient consideration of content and position weight in the traditional TF-IDF algorithm, the position and content weight are modified in the process of IDF value calculation. In this paper, the concept of independence coefficient is introduced as a parameter to measure the correlation between feature entries and their categories. 4. Finally, according to the binary classification characteristics of spam recognition, the corresponding parameters of IDF value calculation are simplified. 5. By comparing the data in corpus, it is shown that the improved TF-IDF algorithm has a great improvement over the traditional TF-IDF algorithm in recall rate, error rate and F1 value. Furthermore, we introduce the theory of support vector machine in machine learning, and establish a recognition and classification model for spam by using the improved TF-IDF algorithm. The model consists of three main modules: training module, test module and statistics module. They realized the training of the system, the classification and determination of unknown mail and the statistics of mail data by extracting and filtering text segmentation, feature entry extraction and filtering, and converting data patterns to compare the similarity of the system, the classification and determination of unknown mail and the statistics of mail data, respectively. they realized the training of the system, the classification of unknown mail and the statistics of mail data. We test the system by using the test mail set in the corpus. The experiment shows that the Chinese spam recognition system can basically effectively identify and isolate most of the spam. Compared with the traditional TF-IDF algorithm and the spam identification system that Tencent has used, it basically realizes the screening and separation of user spam and ensures the normal communication work of users.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.098
本文编号:2481749
[Abstract]:Internet technology brings the 21 st century into the information age, which makes the generation and dissemination of information more convenient than ever. However, Internet technology is also a double-edged sword, precisely because of the convenience of the generation and dissemination of this information, but also led to the proliferation of junk information. It is increasingly becoming one of the hot issues in the field of computer research to identify and eliminate garbage information from these vast amounts of information. At the same time, e-mail business, as one of the most important services in Internet technology, is also constantly interfered with by spam. Therefore, it is necessary to find a feasible method to identify and separate spam in order to ensure normal communication and work needs. In this paper, a spam recognition strategy based on improved TF-IDF (term frequency / inverse document frequency) algorithm is proposed. This strategy is based on the improvement of TF-IDF algorithm, which is widely used in the field of search engines. In this paper, in order to solve the problems of incomplete selection of spam feature words and insufficient discrimination of feature words, the distribution of feature items among various categories is introduced. And the consideration of content and position weight. The main improvement strategies in this paper are as follows: 1. The characteristic correction of information entropy coefficient is carried out by introducing information entropy coefficient into TF-IDF algorithm. Secondly, in view of the insufficient consideration of content and position weight in the traditional TF-IDF algorithm, the position and content weight are modified in the process of IDF value calculation. In this paper, the concept of independence coefficient is introduced as a parameter to measure the correlation between feature entries and their categories. 4. Finally, according to the binary classification characteristics of spam recognition, the corresponding parameters of IDF value calculation are simplified. 5. By comparing the data in corpus, it is shown that the improved TF-IDF algorithm has a great improvement over the traditional TF-IDF algorithm in recall rate, error rate and F1 value. Furthermore, we introduce the theory of support vector machine in machine learning, and establish a recognition and classification model for spam by using the improved TF-IDF algorithm. The model consists of three main modules: training module, test module and statistics module. They realized the training of the system, the classification and determination of unknown mail and the statistics of mail data by extracting and filtering text segmentation, feature entry extraction and filtering, and converting data patterns to compare the similarity of the system, the classification and determination of unknown mail and the statistics of mail data, respectively. they realized the training of the system, the classification of unknown mail and the statistics of mail data. We test the system by using the test mail set in the corpus. The experiment shows that the Chinese spam recognition system can basically effectively identify and isolate most of the spam. Compared with the traditional TF-IDF algorithm and the spam identification system that Tencent has used, it basically realizes the screening and separation of user spam and ensures the normal communication work of users.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.098
【参考文献】
相关期刊论文 前9条
1 施聪莺;徐朝军;杨晓江;;TFIDF算法研究综述[J];计算机应用;2009年S1期
2 徐文海;温有奎;;一种基于TFIDF方法的中文关键词抽取算法[J];情报理论与实践;2008年02期
3 张海龙;王莲芝;;自动文本分类特征选择方法研究[J];计算机工程与设计;2006年20期
4 张玉芳;彭时名;吕佳;;基于文本分类TFIDF方法的改进与应用[J];计算机工程;2006年19期
5 陈文亮;朱靖波;朱慕华;姚天顺;;基于领域词典的文本特征表示[J];计算机研究与发展;2005年12期
6 罗欣,夏德麟,晏蒲柳;基于词频差异的特征选取及改进的TF-IDF公式[J];计算机应用;2005年09期
7 宋枫溪,高林;文本分类器性能评估指标[J];计算机工程;2004年13期
8 王连军;Web文本挖掘浅析[J];现代图书情报技术;2002年06期
9 陈涛;谢阳群;;文本分类中的特征降维方法综述[J];情报学报;2005年06期
相关硕士学位论文 前2条
1 卢扬竹;基于内容的垃圾邮件过滤技术研究[D];西南交通大学;2009年
2 潘文锋;基于内容的垃圾邮件过滤研究[D];中国科学院研究生院(计算技术研究所);2004年
本文编号:2481749
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2481749.html