面向小样本的文本分类模型及算法研究

发布时间：2018-06-20 21:29

本文选题：文本分类 + 机器学习　；参考：《电子科技大学》2017年博士论文

【摘要】：文本数据具有高维稀疏的特征,并且数据量也在爆发式增长,这给传统的机器学习算法带来了许多困难,具体表现在以下两方面:其一,对于分类精度较高的分类算法,如支持向量机和人工神经网络等,大多都因为训练效率、计算资源消耗等瓶颈而无法成功应用于海量数据挖掘和在线分类问题;其二,对于具有线性时间复杂度的分类算法,如质心分类器,朴素贝叶斯和逻辑回归等,其分类精度往往较低。因此,本文针对以上问题展开了一系列研究,研究内容主要包括小样本数据集的提取方法和小样本数据集的分类方法。本文中所指的“小样本”是维度小、数量小的样本。首先,本文研究的小样本提取方法主要包括特征选择方法和实例选择方法,以上方法可以精简海量数据集,从而有效解决以上第一类问题;其次,本文研究了面向小样本的线性分类模型,试图从小样本数据集上学习高精度的分类器,从而有效解决以上第二类问题。本文的主要研究内容和创新点如下:提出了一种新的统计指标(LW-index)的方法来评价特征子集,进而评估降维算法。本文所提出的方法是一种“经典统计”的方法,基于特征子集计算经验估计来评价特征子集的质量。传统的特征子集评估是指将给定的特征子集分解为训练集和测试集,训练集用于估计分类模型的参数,而测试集用于估计模型预测的性能。然后,平均多个预测的结果,即交叉验证(Cross-Validation,CV)。然而,交叉验证评估往往是的是非常耗时的,需要很大的计算开销。实验结果表明本文提出的方法在降维算法评价结果上基本与五折叠交叉验证方法一致,但计算时间开销分别是采用SVM(Support Vector Machine)和CBC(Centroid-Based Classifier)分类器的1/10和1/2倍。提出了一种基于序列前向搜索(Sequential Forward Search,SFS)策略的特征选择算法SFS-LW。文本分类中的封装式特征选择算法(Wapper)筛选出的特征对于分类有较高的价值,但是其评价过程伴随着极大的时间复杂度。为此,本文将封装式特征选择算法中常用的前向序列搜索策略(SFS)与LW指标相结合,提出了一种新的过滤式算法SFS-LW。实验结果表明SFS-LW具有与Wapper方法近似的分类精度,但时间复杂度则有数倍的改善,其时间消耗接近于现有的过滤式方法。提出了一种线性的自适应支持向量选择算法(Shell Extraction,SE)。针对传统分类算法无法应用于海量数据集的问题,本文基于向量空间中样本分布密度不均衡的特点,研究了向量空间中支持向量的识别方法,从而实现了大规模数据集缩减和噪声过滤。传统的实例选择算法大多基于最近邻或聚类的方法,由于此类方法时间复杂度高,同样面临无法应用于海量数据集的问题。实验结果表明本文提出的SE算法不仅在精度上超过了现有算法,并且其执行效率远高于现有的实例选择算法。提出了一种新的分类模型,即引力模型(Gravitation Model,GM)。文本分类中基于质心的分类算法凭借其简单高效,已成为应用最广泛的文本分类算法之一。然而,质心分类算法的精确度过于依赖训练样本的分布,当样本分布偏斜时,质心分类模型不能很好地拟合训练样本数据,因而分类效果不理想。本文提出的GM模型可以有效解决质心分类模型欠拟合问题,在模型训练阶段,GM为每一个类别定义一个表征该类样本分布的质量因子,该因子可从训练样本中学习得到;在模型测试阶段,GM将未知样本划分到对其最大引力的特定类别中。提出了一种基于算术平均质心(Arithmetical Average Centroid,AAC)与随机质量因子学习算法(Stochastic Learning Mass,SLA)相结合的GM模型学习算法AAC-SLA。实验结果表明AAC-SLA算法在精度上持续优于原质心分类算法,而且达到了与目前最好的质心分类器类似的性能,同时具有比它更稳定的优势。提出了基于最小球算法(Minimum Enclosing Ball,MEB)与随机质量因子学习算法(SLA)相结合的GM模型学习算法MEB-SLA。MEB算法可以有效避免类别中样本随机分布给算术平均质心位置带来的影响,实验结果表明MEB-SLA算法要优于AAC-SLA算法,并且在小样本数据集上它们都超过了向量机。最后,本文利用提出的SFS-LW算法和SE算法生成了特征维数和样本数量同时为原数据集1/10倍的小样本数据集,并采用小样本数据集训练AAC-SLA、MEBSLA和SVM算法,实验表明AAC-SLA算法和MEB-SLA算法的学习/分类精度在大部分数据集上只有轻微下降,并持续超过了SVM算法。本文的研究结论是:(1)在中/小规模的数据集学习任务中可直接采用MEB-SLA算法;(2)在大规模的数据集学习任务中可采用SE与AAC-SLA相结合的算法。
[Abstract]:Text data is characterized by high dimension and sparsity, and the amount of data is growing in an explosive manner. It brings many difficulties to the traditional machine learning algorithm, which is shown in the following two aspects: first, for the classification algorithms with high classification precision, such as support vector machines and artificial neural networks, most of them are trained because of the efficiency of training and the consumption of resources. And other bottlenecks can not be successfully applied to mass data mining and online classification problems. Secondly, the classification precision of the classification algorithms with linear time complexity, such as the centroid classifier, simple Bias and logical regression, is often low. Therefore, a series of research on the above problems is carried out in this paper, and the main contents include small samples. The method of extracting data sets and the classification of small sample data sets. The "small sample" in this paper is a small and small sample. First of all, the method of small sample extraction mainly includes the feature selection method and the example selection method. The above method can simplify the sea volume data set so as to effectively solve the above first class questions. Secondly, this paper studies the linear classification model of small sample, and tries to learn the high precision classifier from the small sample data set, so as to effectively solve the above second kinds of problems. The main research contents and innovation points of this paper are as follows: a new statistical index (LW-index) method is proposed to evaluate the feature subset and then to evaluate the dimensionality reduction. The method proposed in this paper is a "classic statistics" method, which evaluates the quality of the feature subset based on the empirical estimation of the feature subset. The traditional feature subset evaluation means that the set of feature subsets is decomposed into a training set and a test set. The training set is used to estimate the parameters of the classification model, and the test set is used to estimate the model preview. The performance of the measurement. Then, the results of the average multiple prediction, namely, Cross-Validation (CV). However, the cross validation evaluation is often very time-consuming and requires a lot of computing overhead. The experimental results show that the proposed method is consistent with the five fold cross validation method in the evaluation result of the dimensionality reduction algorithm, but the calculation time is open. The pin is the 1/10 and 1/2 times of the SVM (Support Vector Machine) and CBC (Centroid-Based Classifier) classifier. A feature selection algorithm based on the sequence forward search (Sequential Forward Search) strategy is proposed. High value, but its evaluation process is accompanied by a great time complexity. Therefore, this paper combines the forward sequence search strategy (SFS) commonly used in the packaged feature selection algorithm and LW index, and proposes a new filtering algorithm SFS-LW. experimental results showing that SFS-LW has an approximate classification accuracy with the Wapper method, but the time is complex. The time consumption is much improved, and its time consumption is close to the existing filtering method. A linear adaptive support vector selection algorithm (Shell Extraction, SE) is proposed. The problem that the traditional classification algorithm can not be applied to mass data sets is not applied. This paper studies the vector space based on the characteristics of the unbalance of the density of sample distribution in vector space. The recognition method of support vector in the middle is realized, and the large scale data set reduction and noise filtering are realized. The traditional example selection algorithm is mostly based on the nearest neighbor or clustering method. Because of the high time complexity of this method, it also faces the problem that can not be applied to the mass data set. The results show that the SE algorithm proposed in this paper is not only fine. More than the existing algorithm, and its efficiency is far higher than the existing example selection algorithm. A new classification model, Gravitation Model (GM), is proposed. The classification algorithm based on the centroid in text classification has become one of the most widely used text classification algorithms with its simplicity and efficiency. However, the centroid classification algorithm is used. The accuracy depends too much on the distribution of training samples. When the sample distribution is skewed, the centroid classification model can not fit the training sample data well, so the classification effect is not ideal. The GM model proposed in this paper can effectively solve the underestimated problem of the centroid classification model. In the model training stage, GM defines one category for each category. The quality factor of the sample distribution can be learned from the training sample. In the model test stage, GM divides the unknown samples into the particular category of its maximum gravity. A Arithmetical Average Centroid (AAC) and random mass factor learning algorithm (Stochastic Learning Mass, SLA) is proposed. The GM model learning algorithm AAC-SLA. experimental results show that the AAC-SLA algorithm is superior to the original centroid classification algorithm in precision, and achieves the performance similar to the best centroid classifier at present, and has a more stable advantage than it. It is based on the least ball algorithm (Minimum Enclosing Ball, MEB) and random quality factor learning. The algorithm (SLA) combined with GM model learning algorithm MEB-SLA.MEB algorithm can effectively avoid the influence of the random distribution of samples to the arithmetic mean centroid position in the class. The experimental results show that the MEB-SLA algorithm is superior to the AAC-SLA algorithm, and they all exceed the vector machines on the small sample data sets. Finally, this paper uses the proposed SFS-LW calculation. The algorithm and SE algorithm generate small sample data sets with the feature dimension and the number of samples at the same time as the original data set 1/10 times, and use small sample data sets to train AAC-SLA, MEBSLA and SVM algorithms. The experiment shows that the learning / classification accuracy of AAC-SLA algorithm and MEB-SLA algorithm is only slightly decreased on most data sets, and it continues to exceed the SVM algorithm. The conclusion of the study are as follows: (1) in the medium / small scale data sets can be directly used in the MEB-SLA algorithm learning tasks; (2) in the large-scale data set by SE combined with AAC-SLA algorithm learning tasks.
【学位授予单位】：电子科技大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】