基于Hadoop云平台的智能推荐物流系统设计与实现

发布时间：2018-05-15 22:10

本文选题：推荐系统 + Hadoop　；参考：《沈阳师范大学》2015年硕士论文

【摘要】：开源的云平台框架Hadoop,随着互联网的高速发展,也在不断完善自身推出具有更高性能、更稳定的版本,在以数据为引导的今日,得到了更加广泛的关注。它是Google一个重要的分布式并行化编程模型MapReduce的开源实现,拥有丰富的服务接口,可以部署在数千个节点集群中,来应对海量的数据计算业务。对于实现并行化的算法程序,使用其MapReduce编程模型,开发者只需要将注意力集中在自身要解决的计算任务上,将自定义好MapReduce类提交给平台相应的接口处理即可,为开发和研究云计算服务、大数据业务处理带来极大的便利性。本文的主要研究工作就是基于Hadoop云平台展开的。论文研究过程中,在VMware虚拟化的服务器上搭建了四个工作节点,在这个小集群的基础上进行智能推荐算法的应用研究工作。文中对于Hadoop平台的部署配置,以及采用MapReduce编程模型为基础实现分布式的并行化计算的编程方法做了仔细的学习研究。文中研究了物流业务平台的原有客户关系等信息,构建了基于Hadoop平台的推荐系统框架,采用离线实验的方式,从业务平台的Oracle数据库中获取实验研究用的原始数据,并通过简单的数据ETL功能模块进行数据转换,使数据比较适应于MapReduce的算法应用。文中的智能推荐的方法采用了基于项目的协同过滤算法,该算法核心是从用户一项目的评分矩阵之中构造出项目间的同现矩阵,进而利用同现矩阵来快速的计算出用户的兴趣物品。该算法的基本实现相对简单,且在处理一定规模的数据集上效率比较高。研究中,以MapReduce编程模型实现了该算法,将其与物流业务平台相结合为物流行业的企业用户提供推荐服务,Hadoop平台对于数据集的分片使得算法的实现出现推荐结果局部化问题,为了解决该问题,以及现有平台的数据增长规模的分析和系统结构的综合分析,提出了利用Redis来构建推荐系统的缓存数据层以此存储算法用到的同现矩阵,同时调整原有算法实现的程序流程,来解决推荐结果局部化问题。文中对两种方法在多个评价指标上进行了分析比队。调整后的程序在利用Redis缓存同现矩阵的实验结果表明,该方法在性能和评价指标上有了明显的改善,运行时间比较合适,能够取得较好的推荐效果,同时在数据集规模增长过程中也具有较好的实时性和可扩展性。
[Abstract]:The open source cloud platform framework Hadoop, with the rapid development of the Internet, is also constantly improving its own higher performance and more stable version. It has been paid more attention to the data - guided today. It is an open source implementation of an important distributed parallel programming model of Google, MapReduce, with a rich service connection. The mouth, which can be deployed in thousands of node clusters to deal with massive data computing services. For parallel algorithms, using its MapReduce programming model, developers only need to focus their attention on the computing tasks to be solved by themselves, and submit a good MapReduce class to the corresponding interface processing of the platform. The main research work of this paper is based on the Hadoop cloud platform. In this paper, four work nodes are built on the VMware virtualized server, and the application research of the intelligent recommendation algorithm is carried out on the basis of this small cluster. In this paper, the deployment configuration of Hadoop platform and the programming method of implementing distributed parallel computing based on MapReduce programming model are studied carefully. The original customer relationship and other information of the logistics service platform are studied in this paper. The framework of the recommendation system based on the Hadoop platform is constructed, and the off-line experiment is adopted. From the Oracle database of the business platform, the original data used in the experimental research are obtained, and the data is converted through the simple data ETL function module. The data comparison is adapted to the application of the MapReduce algorithm. The intelligent recommendation method used in this paper is based on the project based collaborative filtering algorithm. The core of the algorithm is the score from the user one project. The same occurrence matrix is constructed in the matrix, and then the user's interest items are quickly calculated by using the co-occurrence matrix. The basic realization of the algorithm is relatively simple, and the efficiency of the data set on a certain scale is relatively high. In the study, the algorithm is implemented with the MapReduce programming model, which combines it with the logistics service platform. The enterprise users in the logistics industry provide the recommendation service. The Hadoop platform makes the implementation of the algorithm appear localization problem. In order to solve this problem, as well as the analysis of the data growth scale of the existing platform and the comprehensive analysis of the system structure, the caching data layer of the recommendation system is constructed by using Redis. In order to store the co-occurrence matrix used in the algorithm, and adjust the procedure flow of the original algorithm to solve the localization problem of the recommendation results. In this paper, the two methods are analyzed on multiple evaluation indexes. The experimental results of the adjusted program using Redis caching co occurrence matrix show that the method is on performance and evaluation index. With the obvious improvement, the running time is suitable, and the better recommendation effect can be obtained. At the same time, it also has better real-time and extensibility in the process of the scale growth of the data set.

【学位授予单位】：沈阳师范大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.3

【参考文献】