基于YARN和哈希技术的大数据K近邻研究

发布时间：2018-12-16 21:15

【摘要】：大数据是近几年机器学习领域最热门的研究方向之一,大数据给传统的机器学习带来了巨大挑战。K-近邻是一种著名的分类算法。由于它简单且易于实现,所以被广泛应用于许多领域,如人脸识别、基因分类、决策支持等。然而,在大数据环境中,K-近邻算法的效率变得非常低,甚至不可行。针对这一问题,基于Yarn和哈希技术,本文提出了两种解决方案:一种用Mapreduce和SimHash在云计算平台上实现针对大数据集的K-近邻分类;另一种用Spark和SimHash在云计算平台上实现针对大数据集的K-近邻分类。两种解决方案的基本思路是类似的,包括三步:(1)对大数据集做哈希变换,将其变换到海明空间;(2)在海明空间中,基于云计算Yarn平台用大数据计算框架Mapreduce和Spark寻找与测试样例x在同一个桶中的训练样例;(3)在同一个桶中再寻找测试样例x的K个精确近邻,并用这K个精确近邻对x进行分类。实验结果显示,在分类能力保持的前提下,本文提出的解决方案是可行的,而且可以大幅度地提高K-近邻算法的效率。
[Abstract]:Big data is one of the most popular research fields in the field of machine learning in recent years. Big data brings great challenges to the traditional machine learning. K- nearest neighbor is a famous classification algorithm. Because it is simple and easy to implement, it is widely used in many fields, such as face recognition, gene classification, decision support and so on. However, in big data environment, the efficiency of K-nearest neighbor algorithm becomes very low, even infeasible. Aiming at this problem, based on Yarn and hash technology, this paper proposes two solutions: one is to use Mapreduce and SimHash to realize K-nearest neighbor classification for big data set on cloud computing platform; Another is to use Spark and SimHash to implement K-nearest neighbor classification for big data set on cloud computing platform. The basic ideas of the two solutions are similar, including three steps: (1) Hash transformation of big data set and transform it into Heming space; (2) in Haiming space, based on cloud computing Yarn platform, big data computing framework Mapreduce and Spark are used to find and test sample x training samples in the same bucket; (3) the K exact nearest neighbors of test sample x are found in the same bucket, and the K exact nearest neighbors are used to classify x. The experimental results show that the proposed scheme is feasible and can greatly improve the efficiency of the K-nearest neighbor algorithm on the premise of maintaining the classification ability.
【学位授予单位】：河北大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;TP181

【参考文献】