文本分类中特征加权算法和文本表示策略研究

发布时间：2018-07-11 18:10

本文选题：机器学习 + 文本分类　；参考：《东北师范大学》2016年博士论文

【摘要】：数据已经渗透到各个行业,成为重要的生产因素。随着大数据时代的到来,对文本信息处理技术的需求与日俱增,人工管理方式已经无法满足社会需求,因此,自动文本分类技术变得越来越重要,已成为广大科研团体研究的热点。本文在分析和总结文本分类框架、文本表示模型、文本预处理、特征选择、特征提取、特征加权、文本分类器以及分类性能评估的基础上,对文本特征加权和文本表示策略进行了深入研究。面向均衡数据集,提出了两种特征加权算法;面向失衡数据集,提出了一种特征加权算法,共计三种有监督特征加权算法。此外,针对有监督特征加权算法,本文提出了一种最优文本表示策略。取得的阶段性成果如下:1.基于类别信息的特征加权算法对于采用向量空间模型的大多数文本分类器来说,特征加权一直是分类的瓶颈,特征加权的效果直接影响分类器的分类性能。在分析传统特征加权算法的基础上,提出了一种新的特征加权算法。通过将基于词的特征转换为基于类别的特征,使数据集的特征维度由原始成千上万维降低到了与数据集的类别数相同的维度。从而使得特征表示矩阵不再是稀疏矩阵。相比其他特征加权方法,本文的方法不但可以提高文本分类精度,而且可以有效地提高分类速度、降低分类时间。2.基于类空间密度的特征加权算法在分析传统特征加权算法中的逆类别频率方法基础上,引入了类空间密度,进而将逆类别空间密度频率引入到了特征加权算法中。在度量特征的区分能力时,针对类别频率相同,但在此类别频率下文档频率不同的情况,可以为特征赋予不同的权重。该方法能更加客观地反映特征对分类的重要程度,有效地改善样本空间分布状态,使同类别样本更加紧凑,异类别样本更加松散。通过将tf*icf和icf-based方法中的逆类别频率参数更新为本文提出的逆类别空间密度频率参数,得到了两个新的特征加权算法:tf*ICSDF和ICSDF-based。实验结果表明,本文的特征加权算法可以获得较好的文本分类性能。3.面向失衡数据集的特征加权算法当采用常用特征加权算法对失衡数据集进行加权,经常不能达到预期的效果。主要是由于失衡数据集数据分布的特殊性所导致。本文在分析失衡数据集数据分布特点的基础上,提出了一种面向失衡数据集的特征加权算法。算法通过结合特征在正类别文档中出现的概率与特征在负类别文档中出现的概率两个方面,综合度量失衡数据集中不同特征对于文本分类的重要性,并根据其重要性赋予相应的特征权重。实验中,将提出的tf*WID特征加权算法与四个常用的特征加权算法(tf*idf,tf*ig,tf*chi2以及tf*or)在WebKB和Yahoo!Answers(100-1000)两个失衡数据集上,采用Rocchio分类器和支持向量机分类器,针对微平均F1值与宏平均F1值两个方面进行了对比与分析。结果显示,本文提出的特征加权算法对于失衡数据集分类,可以有效地提高分类性能。4.有监督特征加权方法的最优文本表示策略在分析传统文本表示策略的基础上(全局策略和局部策略),本文基于向量空间模型,提出了一种对于有监督特征加权方法的最优文本表示策略。提出的方法采用在训练集上寻找最优模型的思想,可以从所有类别的特征加权向量中,获得一个对训练集最优的特征加权向量,将其应用于测试集后,最终可以得到测试集的最优文本表示。在两个数据集(均衡数据集20Newsgroups和非均衡数据集Reuters-21578)上,对本文所提出的方法进行了验证。实验中采用两个常用的有监督特征加权方法(tf*or和tf*rf)对两个数据集的特征矩阵进行加权,应用提出的方法,在训练集上寻找最优特征加权向量,然后应用于测试集,最后采用支持向量机分类器进行分类。实验结果表明,本文提出的有监督特征加权方法的最优文本表示策略能够有效地提高分类性能。
[Abstract]:Data has penetrated into various industries and becomes an important production factor. With the advent of the era of large data, the demand for text information processing technology is increasing, and manual management has not been able to meet the needs of the society. Therefore, the automatic text classification technology has become more and more important and has become a hot spot in the research group. On the basis of the text classification framework, text representation model, text preprocessing, feature selection, feature extraction, feature weighting, text classifier and classification performance evaluation, the text feature weighting and text representation strategy are deeply studied. Two feature weighting algorithms are proposed for balanced data sets, and unbalance data sets are put forward. A feature weighting algorithm is proposed, including three supervised feature weighting algorithms. In addition, an optimal text representation strategy is proposed for the supervised feature weighting algorithm. The results obtained are as follows: 1. the feature weighting algorithm based on category information is used for most text classifiers using vector space model. Feature weighting has always been the bottleneck of classification. The effect of feature weighting directly affects the classification performance of the classifier. Based on the analysis of the traditional feature weighting algorithm, a new feature weighting algorithm is proposed. By converting the features based on the word to the category based feature, the feature dimension of the dataset is reduced from the original thousand dimensions to the universal dimension. The feature representation matrix is no longer a sparse matrix. Compared with other feature weighting methods, this method can not only improve the accuracy of text classification, but also effectively improve the classification speed and reduce the classification time.2. based on the characteristic weighting algorithm based on the class space density. On the basis of the inverse class frequency method in the eigen weighted algorithm, the class space density is introduced, and then the inverse class space density frequency is introduced into the feature weighting algorithm. When measuring the distinguishing ability of the feature, the class frequency is the same, but the frequency of the document is different at the same frequency, which can give different weights to the feature. The method can more objectively reflect the importance of characteristics to the classification, effectively improve the distribution of sample space, make the same class samples more compact, and the different classes of samples are looser. By updating the inverse class frequency parameters in the tf*icf and ICF-based methods into the inverse class space density frequency parameters proposed in this paper, two new ones are obtained. Feature weighting algorithm: tf*ICSDF and ICSDF-based. experimental results show that the feature weighting algorithm in this paper can obtain better text classification performance.3. feature weighted algorithm oriented to unbalance data set, when using the common feature weighting algorithm to weigh the unbalanced data set, often can not achieve the expected effect. Mainly because of the unbalanced data. In this paper, based on the analysis of the characteristics of the data distribution of the unbalance data set, this paper presents a feature weighting algorithm for unbalance data sets. The algorithm combines the probability and the probability of the feature in the positive category document with the two aspects of the probability of the appearance of the character in the negative category document. In the experiment, the proposed tf*WID feature weighting algorithm and four common feature weighting algorithms (tf*idf, tf*ig, tf*chi2 and tf*or) are used on the two unbalanced data sets of WebKB and Yahoo! Answers (100-1000), using the Rocchio classifier and the Rocchio classifier in the experiment. The support vector machine classifier is compared and analyzed in two aspects: the micro average F1 value and the macro average F1 value. The results show that the feature weighting algorithm proposed in this paper can effectively improve the optimal text representation strategy of the classification performance.4. with supervised feature weighting method to analyze the traditional text representation strategy. On the basis of the global strategy and local strategy, based on the vector space model, this paper proposes an optimal text representation strategy for supervised feature weighting methods. The proposed method uses the idea of finding the optimal model on the training set, and can obtain an optimal feature added to the training set from the feature weighted vector of all categories. The weight vector, which is applied to the test set, can finally get the optimal text representation of the test set. On the two data sets (the balanced dataset 20Newsgroups and the disequilibrium data set Reuters-21578), the proposed method is verified. In the experiment, two commonly used supervised feature weighting methods (tf*or and tf*rf) are used for two numbers. According to the feature matrix of the set, the optimal feature weighting vector is found on the training set, and then applied to the test set. Finally, the support vector machine classifier is used to classify them. The experimental results show that the most Youleben representation strategy with supervised feature weighting method proposed in this paper can effectively improve the classification performance.
【学位授予单位】：东北师范大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】