基于词频类别相关的特征权重算法
发布时间:2019-05-13 10:17
【摘要】:在文本分类领域中,目前关于特征权重的研究存在两方面不足:一方面,对于基于文档频率的特征权重算法,其中的文档频率常常忽略特征的词频信息;另一方面,对特征与类别的关系表达不够准确和充分。针对以上不足,提出一种新的基于词频的类别相关特征权重算法(CDF-AICF)。该算法在度量特征权重时,考虑了特征在每个词频下的文档频率。同时,为了准确表达特征与类别的关系,提出了两个新的概念:类别相关文档频率CDF和平均逆类频率AICF,分别用于表示特征对类别的表现力和区分力。最后,通过与其他五个特征权重度量方法相比较,在三个数据集上进行分类实验,结果显示,CDF-AICF的分类性能优于其他五种度量方法。
[Abstract]:In the field of text classification, there are two shortcomings in the current research on feature weight: on the one hand, for the feature weight algorithm based on document frequency, the document frequency often ignores the word frequency information of features; On the other hand, the expression of the relationship between features and categories is not accurate and sufficient. In order to overcome these shortcomings, a new class-related feature weight algorithm (CDF-AICF) based on word frequency is proposed. When measuring the feature weight, the algorithm takes into account the document frequency of the feature at each word frequency. At the same time, in order to accurately express the relationship between features and categories, two new concepts are proposed: category related document frequency CDF and average inverse class frequency AICF, are used to represent the expressive force and discriminant force of features to categories, respectively. Finally, compared with the other five feature weight measurement methods, the classification experiments are carried out on three data sets, and the results show that the classification performance of CDF-AICF is better than that of the other five measurement methods.
【作者单位】: 电子工程学院网络系;
【分类号】:TP391.1
,
本文编号:2475801
[Abstract]:In the field of text classification, there are two shortcomings in the current research on feature weight: on the one hand, for the feature weight algorithm based on document frequency, the document frequency often ignores the word frequency information of features; On the other hand, the expression of the relationship between features and categories is not accurate and sufficient. In order to overcome these shortcomings, a new class-related feature weight algorithm (CDF-AICF) based on word frequency is proposed. When measuring the feature weight, the algorithm takes into account the document frequency of the feature at each word frequency. At the same time, in order to accurately express the relationship between features and categories, two new concepts are proposed: category related document frequency CDF and average inverse class frequency AICF, are used to represent the expressive force and discriminant force of features to categories, respectively. Finally, compared with the other five feature weight measurement methods, the classification experiments are carried out on three data sets, and the results show that the classification performance of CDF-AICF is better than that of the other five measurement methods.
【作者单位】: 电子工程学院网络系;
【分类号】:TP391.1
,
本文编号:2475801
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2475801.html