Spark下的并行多标签最近邻算法

发布时间：2019-05-07 07:22

【摘要】：随着大数据时代的到来,大规模多标签数据挖掘方法受到广泛关注。多标签最近邻算法MLKNN是一种简单高效、应用广泛的多标签分类方法,其分类精度在很多应用中都高于其他常见的多标签学习方法。然而随着需要处理的数据规模越来越大,传统串行ML-KNN算法已经难以满足大数据应用中时间和存储空间上的限制。结合Spark的并行机制和其基于内存的迭代计算特点,提出了一种基于Spark并行框架的ML-KNN算法SML-KNN。在Map阶段分别找到待预测样本每个分区的K近邻,随后Reduce阶段根据每个分区的近邻集合确定最终的K近邻,最后并行地对近邻的标签集合进行聚合,通过最大化后验概率准则输出待预测样本的目标标签集合。串行和并行环境下的对比实验结果表明,SML-KNN在保证分类精度的前提下性能与计算资源呈近似线性关系,提高了ML-KNN算法对大规模多标签数据的处理能力。
[Abstract]:With the arrival of big data era, large-scale multi-label data mining methods have received extensive attention. Multi-label nearest neighbor algorithm (MLKNN) is a simple, efficient and widely used multi-label classification method, and its classification accuracy is higher than other common multi-label learning methods in many applications. However, with the increasing scale of data to be processed, the traditional serial ML-KNN algorithm has been difficult to meet the time and storage space constraints in big data's application. Combined with the parallel mechanism of Spark and the characteristics of iterative computation based on memory, a ML-KNN algorithm SML-KNN. based on Spark parallel framework is proposed. The K nearest neighbors of each partition are found in the Map phase, and then the final K nearest neighbors are determined according to the nearest neighbor sets of each partition in the Reduce phase. Finally, the label sets of the nearest neighbors are aggregated in parallel. The target label set of samples to be predicted is outputted by maximizing the posterior probability criterion. The experimental results in serial and parallel environments show that the performance of SML-KNN is approximately linear with computing resources on the premise of ensuring the accuracy of classification, which improves the processing ability of ML-KNN algorithm to large-scale multi-label data.
【作者单位】：重庆邮电大学计算智能重庆市重点实验室;
【基金】：重庆市基础与前沿研究计划项目(csts2014jcyjA40001,cstc2014jcyjA40022) 重庆市教委科学技术研究项目(自然科学类)(KJ1400436)
【分类号】：TP181

【相似文献】