基于Spark框架的高效KNN中文文本分类算法

发布时间：2018-05-11 12:28

本文选题：K-最近邻 + 聚类　；参考：《计算机应用》2016年12期

【摘要】：针对K-最近邻(KNN)分类算法时间复杂度与训练样本数量成正比而导致的计算量大的问题以及当前大数据背景下面临的传统架构处理速度慢的问题,提出了一种基于Spark框架与聚类优化的高效KNN分类算法。该算法首先利用引入收缩因子的优化K-medoids聚类算法对训练集进行两次裁剪;然后在分类过程中迭代K值获得分类结果,并在计算过程中结合Spark计算框架对数据进行分区迭代实现并行化。实验结果表明,在不同数据集中传统K-最近邻算法、基于K-medoids的K-最近邻算法所耗费时间是所提Spark框架下的K-最近邻算法的3.92~31.90倍,所提算法具有较高的计算效率,相较于Hadoop平台有较好的加速比,可有效地对大数据进行分类处理。
[Abstract]:In order to solve the problem that the time complexity of K- nearest neighbor KNN algorithm is proportional to the number of training samples, and the problem of slow processing speed of traditional architecture under the background of big data, this paper proposes a new approach to solve the problem. An efficient KNN classification algorithm based on Spark framework and clustering optimization is proposed. The algorithm firstly uses the optimal K-medoids clustering algorithm with shrinkage factor to cut the training set twice, and then iterates the K value in the process of classification to obtain the classification result. In the process of calculation, the data is parallelized by partition iteration combined with Spark computing framework. The experimental results show that the traditional K- nearest neighbor algorithm based on K-medoids consumes 3.92 times as much time as the K- nearest neighbor algorithm based on Spark in different data sets, and the proposed algorithm has a high computational efficiency. Compared with Hadoop platform, it has a better speedup ratio and can effectively classify big data.
【作者单位】：曲阜师范大学信息科学与工程学院;曲阜师范大学软件学院;
【基金】：国家自然科学基金资助项目(61402258) 山东省本科高校教学改革研究项目(2015M102) 校级教学改革研究项目(jg05021*)~~
【分类号】：TP391.1

【相似文献】