基于Spark的分布式协同过滤及工具研究

发布时间：2018-07-02 21:29

本文选题：Spark + 协同过滤　；参考：《南京大学》2017年硕士论文

【摘要】：随着移动互联网与物联网的技术的飞速发展,人类收集的数据量呈指数级增加。分布式计算已经成为大数据处理、分析过程中不可或缺的关键技术。分布式计算通过将计算任务分解为可并发执行的多个子问题并在互连的多台计算节点上同时运行,解决了传统算法面临的单机性能瓶颈、难以扩展的问题。关于分布式机器学习算法的研究也成为了工业界和产业界的研究热点。在众多的分布式计算框架中,Spark以其高容错、高可扩展和易用的特点得到了广泛的应用。但对其上实现的分布式算法的复杂度分析和比较仍缺乏同一的分析框架。因此,对具体算法在Spark平台上的可伸缩性以及性能无法进行理论上的分析与对比,只能进行经验分析。本文基于对Spark分布式平台的研究,提出了一种对Spark上分布式算法的复杂度分析框架,并以基于Spark的协同过滤算法作为应用场景。证明了通过该框架能够有效的指导算法的开发与运行时环境配置。具体地,本文做了如下工作:首先,本文首先对分布式计算和协同过滤技术做了介绍。分布式计算部分对流行的Hadoop和Spark分布式计算平台的计算模型、运行模型、设计理念都给出了具体分析,并对其原理给出了解释。协同过滤部分中,对基于内存的协同过滤和基于矩阵分解的协同过滤技术进行了分析,介绍了多种经典算法。然后,本文提出了一种对Spark上分布式算法的复杂度分析框架,并在此基础上对多种基于Spark的分布式协同过滤算法做了复杂度分析和实验分析,包括基于内存的协同过滤算法的三种并行化方法和基于矩阵分解的三种并行化方法。最后,本文设计了一款基于Spark的数据挖掘工具箱。工具箱通过将数据挖掘算法组件化,提供基于配置的数据分析应用开发模型,解决了分析人员难以使用Spark的问题。通过使用该工具箱,用户可以方便的使用各种分布式数据挖掘算法处理海量数据而无需编程能力,本文详细介绍了工具箱的功能与开发设计过程。
[Abstract]:With the rapid development of mobile Internet and Internet of things, the amount of data collected increases exponentially. Distributed computing has become an indispensable key technology in big data processing. By decomposing computing tasks into multiple concurrent execution sub-problems and running simultaneously on multiple interconnected computing nodes, distributed computing solves the problem of single machine performance bottleneck faced by traditional algorithms, which is difficult to extend. The research on distributed machine learning algorithm has also become a hotspot in industry and industry. Spark has been widely used in many distributed computing frameworks because of its high fault tolerance, high scalability and ease of use. However, there is still a lack of the same analysis framework for the complexity analysis and comparison of the distributed algorithms implemented on it. Therefore, the scalability and performance of the algorithm on Spark platform can not be theoretically analyzed and compared, but empirical analysis can only be carried out. Based on the research of Spark distributed platform, this paper presents a complexity analysis framework for Spark distributed algorithm, and uses Spark based collaborative filtering algorithm as the application scenario. It is proved that this framework can effectively guide the development and runtime environment configuration of the algorithm. Specifically, this paper does the following work: first, this paper introduces distributed computing and collaborative filtering technology. In the part of distributed computing, the calculation models, operation models and design concepts of the popular Hadoop and Spark distributed computing platforms are analyzed in detail, and their principles are explained. In the part of collaborative filtering, memory based collaborative filtering and matrix decomposition based collaborative filtering are analyzed, and several classical algorithms are introduced. Then, this paper presents a complexity analysis framework for distributed algorithms on Spark, and makes complexity analysis and experimental analysis on various distributed collaborative filtering algorithms based on Spark. It includes three parallelization methods of memory-based collaborative filtering algorithm and three parallelization methods based on matrix decomposition. Finally, this paper designs a data mining toolbox based on Spark. By compartmentalizing data mining algorithms, the toolbox provides a configurable data analysis application development model, which solves the problem that it is difficult for analysts to use Spark. Through the use of the toolbox, users can easily use a variety of distributed data mining algorithms to process mass data without programming ability. This paper introduces the function of toolbox and the process of development and design in detail.
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【相似文献】