基于Spark的机器学习模型分析与研究

发布时间：2018-08-26 18:16

【摘要】：在分布式计算为主流的时代背景下,基于MapReduce框架的分布式应用频繁的I/O操作使得它的效率和性能不能够得到完美的体现。基于RDD的Spark分布式计算框架能够将数据加载进内存,极大的适应了迭代式机器学习模型的特定需求。针对目前基于MapReduce设计实现的机器学习模型存在的问题(主要是MR的本质问题),研究了基于Spark的机器学习模型,主要包括KMeans聚类、ALS协同过滤。并且研究了基于Spark Streaming的在线机器学习模型。以下是文章的主要分析与研究内容简介:(1)文章基于Spark分布式计算框架设计并实现了并行KMeans聚类模型,并通过该模型在不同规模的MovieLens数据集上进行训练比对实验,结果表明,该并行KMeans聚类模型适合运行在分布式集群环境下,且并行化计算效率也有不俗的表现;其次通过repartition算子设计分片加载数据,优化并行方案,有效减少了模型的训练时间。(2)针对基于MapReduce框架处理海量数据实时响应能力较差的问题,设计并实现了基于Spark Streaming的在线计算模型进行大规模的KMeans聚类分析。该模型将整个过程分为数据接入、在线训练等模块,各模块通过数据流连通形成任务实体,提交到Spark分布式集群运行完成。通过比对分析实验和性能检测,验证了该在线KMeans聚类模型具有高吞吐、低延迟的优势,且集群运行状况良好。(3)ALS(最小二乘法)协同过滤推荐算法是通过矩阵分解进行推荐,它通过综合大量的用户评分数据进行计算,并存储计算过程中产生的大量特征矩阵。Hadoop的HA(高可用性)用来解决HDFS分布式文件系统的NameNode单点故障问题。Spark作为一种基于内存的新型分布式大数据计算框架,具有优异的计算性能。文章基于QJM(Quorum Journal Manager)构建了 HA下的Hadoop大数据平台,并在Spark计算框架基础上研究使用ALS协同过滤算法,实现基于ALS协同过滤算法在Spark上的并行化运行;通过和基于Hadoop的MapReduce思想的ALS协同过滤算法在Netflix数据集上的比对实验表明,基于Spark平台的ALS协同过滤算法的并行化计算效率有明显提升,并且更适合处理海量数据。
[Abstract]:Under the background of the mainstream of distributed computing, the efficiency and performance of distributed applications based on MapReduce framework can not be reflected perfectly because of the frequent I / O operations. The Spark distributed computing framework based on RDD can load data into memory, which greatly meets the specific requirements of iterative machine learning model. Aiming at the problems of the machine learning model based on MapReduce (mainly the essential problem of MR), this paper studies the machine learning model based on Spark, including KMeans clustering and collaborative filtering. An online machine learning model based on Spark Streaming is also studied. The following are the main analysis and research contents: (1) this paper designs and implements a parallel KMeans clustering model based on Spark distributed computing framework, and carries out training and comparison experiments on MovieLens data sets of different scales through this model. The results show that, The parallel KMeans clustering model is suitable for running in the distributed cluster environment, and the parallel computing efficiency is also good. Secondly, the parallel scheme is optimized by using repartition operator to design piecewise data loading. The training time of the model is reduced effectively. (2) aiming at the problem of poor real-time response ability of processing massive data based on MapReduce framework, an online computing model based on Spark Streaming is designed and implemented for large-scale KMeans clustering analysis. The model divides the whole process into data access, online training and other modules. Each module is connected by data flow to form a task entity, which is submitted to the Spark distributed cluster to run. By comparing and analyzing experiments and performance testing, it is proved that the online KMeans clustering model has the advantages of high throughput and low delay, and the cluster runs well. (3) ALS (least square) collaborative filtering recommendation algorithm is recommended by matrix decomposition. It's calculated by synthesizing a lot of user rating data, And store a large number of feature matrices. Hadoop HA (high availability) used to solve the HDFS distributed file system NameNode single point problem. Spark as a new memory based distributed big data computing framework. Excellent computing performance. In this paper, the Hadoop big data platform under HA is constructed based on QJM (Quorum Journal Manager), and the ALS collaborative filtering algorithm is studied on the basis of Spark computing framework to realize the parallel running of ALS based collaborative filtering algorithm on Spark. The comparison experiment with ALS collaborative filtering algorithm based on MapReduce based on Hadoop on Netflix dataset shows that the parallel computing efficiency of ALS collaborative filtering algorithm based on Spark platform is obviously improved and it is more suitable to deal with mass data.
【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;TP181

【参考文献】