当前位置:主页 > 科技论文 > 自动化论文 >

基于YARN和哈希技术的大数据K近邻研究

发布时间:2018-12-16 21:15
【摘要】:大数据是近几年机器学习领域最热门的研究方向之一,大数据给传统的机器学习带来了巨大挑战。K-近邻是一种著名的分类算法。由于它简单且易于实现,所以被广泛应用于许多领域,如人脸识别、基因分类、决策支持等。然而,在大数据环境中,K-近邻算法的效率变得非常低,甚至不可行。针对这一问题,基于Yarn和哈希技术,本文提出了两种解决方案:一种用Mapreduce和SimHash在云计算平台上实现针对大数据集的K-近邻分类;另一种用Spark和SimHash在云计算平台上实现针对大数据集的K-近邻分类。两种解决方案的基本思路是类似的,包括三步:(1)对大数据集做哈希变换,将其变换到海明空间;(2)在海明空间中,基于云计算Yarn平台用大数据计算框架Mapreduce和Spark寻找与测试样例x在同一个桶中的训练样例;(3)在同一个桶中再寻找测试样例x的K个精确近邻,并用这K个精确近邻对x进行分类。实验结果显示,在分类能力保持的前提下,本文提出的解决方案是可行的,而且可以大幅度地提高K-近邻算法的效率。
[Abstract]:Big data is one of the most popular research fields in the field of machine learning in recent years. Big data brings great challenges to the traditional machine learning. K- nearest neighbor is a famous classification algorithm. Because it is simple and easy to implement, it is widely used in many fields, such as face recognition, gene classification, decision support and so on. However, in big data environment, the efficiency of K-nearest neighbor algorithm becomes very low, even infeasible. Aiming at this problem, based on Yarn and hash technology, this paper proposes two solutions: one is to use Mapreduce and SimHash to realize K-nearest neighbor classification for big data set on cloud computing platform; Another is to use Spark and SimHash to implement K-nearest neighbor classification for big data set on cloud computing platform. The basic ideas of the two solutions are similar, including three steps: (1) Hash transformation of big data set and transform it into Heming space; (2) in Haiming space, based on cloud computing Yarn platform, big data computing framework Mapreduce and Spark are used to find and test sample x training samples in the same bucket; (3) the K exact nearest neighbors of test sample x are found in the same bucket, and the K exact nearest neighbors are used to classify x. The experimental results show that the proposed scheme is feasible and can greatly improve the efficiency of the K-nearest neighbor algorithm on the premise of maintaining the classification ability.
【学位授予单位】:河北大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13;TP181

【参考文献】

相关期刊论文 前7条

1 黄宜华;;大数据机器学习系统研究进展[J];大数据;2015年01期

2 李武军;周志华;;大数据哈希学习:现状与趋势[J];科学通报;2015年Z1期

3 陈洁;陈冬杰;黄帮明;;基于HBASE的大数据压缩算法的研究[J];电脑知识与技术;2014年13期

4 张长水;;机器学习面临的挑战[J];中国科学:信息科学;2013年12期

5 姚吉龙;张潇磊;;基于Hadoop的性能优化分析[J];科技创新导报;2013年25期

6 闫永刚;马廷淮;王建;;KNN分类算法的MapReduce并行化实现[J];南京航空航天大学学报;2013年04期

7 李国杰;程学旗;;大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考[J];中国科学院院刊;2012年06期



本文编号:2383061

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2383061.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户19762***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com