一种改进的TF-IDF算法实现及其在垃圾邮件识别中的应用

发布时间：2019-05-20 15:40

【摘要】：互联网技术将21世纪带入了信息时代，它使信息的产生和传播变得前所未有的便捷。然而互联网技术也是一把双刃剑，正是由于这种信息产生和传播方面的便捷也同时导致了垃圾信息的泛滥。从这些浩如烟海的信息中识别出垃圾信息并加以排除正日益成为目前计算机领域研究的热点问题之一。与此同时电子邮件业务作为互联网技术中最重要的业务之一,也被垃圾信息不断的干扰。由此，需要找到一种切实可行的方法对垃圾邮件进行识别和分离，以保障正常的通信和工作需要。本文提出了一种基于改进TF-IDF（term frequency inverse document frequency）算法的垃圾邮件识别策略。该策略是基于在搜索引擎领域应用较为广泛的TF-IDF算法改进的，本文针对该算法对垃圾邮件特征词选取不全面，，特征词区分度不够等问题，引入了对特征项在各类之间分布，以及内容、位置权重的考量。本文中主要的改进策略有以下几点： 1.通过对TF-IDF算法中权值引入信息熵系数进行特征修正； 2.其次，我们针对传统TF-IDF算法中对内容和位置权重考虑不足的情况，在IDF值计算过程中引入位置和内容权值进行修正； 3.本文引入了独立性系数的概念作为衡量特征词条与所分类别之间关联性的参数。 4.最后，根据垃圾邮件识别的二元分类特征，简化了IDF值计算的相应的参数。 5.通过对语料库中的数据进行对比试验表明，改进的TF-IDF算法比传统的TF-IDF 算法在召回率，错误率以及F1值等方面的指标均有较大提高。进一步，我们引入了机器学习中的支持向量机理论，应用改进后的TF-IDF算法建立了一个对垃圾邮件进行识别分类模型。该模型包含三个主要模块：训练模块，测试模块和统计模块。他们分别通过对邮件进行文本分词，特征词条的提取和筛选，转换数据模式进行相似度比较实现了对系统的训练、对未知邮件的分类判定和对邮件数据统计的相关工作。我们通过使用语料库中的测试邮件集合对系统进行测试，实验证明我们实现的中文垃圾邮件识别系统能够基本有效的对大部分垃圾邮件进行识别和隔离。与基于传统的TF-IDF算法以及腾讯公司曾经使用过的垃圾邮件识别系统相比有显著的提高，基本实现了对用户垃圾邮件进行筛选分离，保障用户正常通信工作的需求。
[Abstract]:Internet technology brings the 21 st century into the information age, which makes the generation and dissemination of information more convenient than ever. However, Internet technology is also a double-edged sword, precisely because of the convenience of the generation and dissemination of this information, but also led to the proliferation of junk information. It is increasingly becoming one of the hot issues in the field of computer research to identify and eliminate garbage information from these vast amounts of information. At the same time, e-mail business, as one of the most important services in Internet technology, is also constantly interfered with by spam. Therefore, it is necessary to find a feasible method to identify and separate spam in order to ensure normal communication and work needs. In this paper, a spam recognition strategy based on improved TF-IDF (term frequency / inverse document frequency) algorithm is proposed. This strategy is based on the improvement of TF-IDF algorithm, which is widely used in the field of search engines. In this paper, in order to solve the problems of incomplete selection of spam feature words and insufficient discrimination of feature words, the distribution of feature items among various categories is introduced. And the consideration of content and position weight. The main improvement strategies in this paper are as follows: 1. The characteristic correction of information entropy coefficient is carried out by introducing information entropy coefficient into TF-IDF algorithm. Secondly, in view of the insufficient consideration of content and position weight in the traditional TF-IDF algorithm, the position and content weight are modified in the process of IDF value calculation. In this paper, the concept of independence coefficient is introduced as a parameter to measure the correlation between feature entries and their categories. 4. Finally, according to the binary classification characteristics of spam recognition, the corresponding parameters of IDF value calculation are simplified. 5. By comparing the data in corpus, it is shown that the improved TF-IDF algorithm has a great improvement over the traditional TF-IDF algorithm in recall rate, error rate and F1 value. Furthermore, we introduce the theory of support vector machine in machine learning, and establish a recognition and classification model for spam by using the improved TF-IDF algorithm. The model consists of three main modules: training module, test module and statistics module. They realized the training of the system, the classification and determination of unknown mail and the statistics of mail data by extracting and filtering text segmentation, feature entry extraction and filtering, and converting data patterns to compare the similarity of the system, the classification and determination of unknown mail and the statistics of mail data, respectively. they realized the training of the system, the classification of unknown mail and the statistics of mail data. We test the system by using the test mail set in the corpus. The experiment shows that the Chinese spam recognition system can basically effectively identify and isolate most of the spam. Compared with the traditional TF-IDF algorithm and the spam identification system that Tencent has used, it basically realizes the screening and separation of user spam and ensures the normal communication work of users.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.098

【参考文献】