一种基于特征库投影的文本分类算法

发布时间：2018-10-23 18:44

【摘要】：基于KNN的主流文本分类策略适合样本容量较大的自动分类,但存在时间复杂度偏高、特征降维和样本剪裁易出现信息丢失等问题,本文提出一种基于特征库投影(FLP)的分类算法。该算法首先将所有训练样本的特征按照一定的权重策略构筑特征库,通过特征库保留所有样本特征信息;然后,通过投影函数,根据待分类样本的特征集合将每个分类的特征库映射为投影样本,通过计算新样本与各分类投影样本的相似度来完成分类。采用复旦大学国际数据库中心自然语言处理小组整理的语料库对所提出的分类算法进行验证,分小量训练文本和大量训练文本2个场景进行测试,并与基于聚类的KNN算法进行对比。实验结果表明:FLP分类算法不会丢失分类特征,分类精确度较高;分类效率与样本规模的增长不直接关联,时间复杂度低。
[Abstract]:The mainstream text classification strategy based on KNN is suitable for automatic classification with large sample size, but it has some problems such as high time complexity, feature reduction and sample clipping, etc. In this paper, a classification algorithm based on feature base projection (FLP) is proposed. In this algorithm, the feature of all training samples is constructed according to a certain weight strategy, and the feature information of all samples is preserved through the feature library. According to the feature set of the samples to be classified, the feature bank of each classification is mapped to the projection sample, and the classification is completed by calculating the similarity between the new sample and the projection sample of each classification. The proposed classification algorithm is verified by the corpus compiled by the Natural language processing Group of the International Database Center of Fudan University. The proposed classification algorithm is tested in two scenarios: a small number of training texts and a large number of training texts. And compared with KNN algorithm based on clustering. The experimental results show that the FLP classification algorithm does not lose the classification features, and the classification accuracy is high, the classification efficiency is not directly related to the growth of sample size, and the time complexity is low.
【作者单位】：湖南大学校园信息化建设与管理办公室;湖南商学院旅游管理学院;湖南大学信息工程与科学学院;
【基金】：国家自然科学基金资助项目(61672221,61304184,61672156)~~
【分类号】：TP391.1

【相似文献】