基于Mahout、Hadoop的推荐系统研究与实现

发布时间：2018-03-11 05:05

本文选题：推荐系统　切入点：协同过滤　出处：《长江大学》2016年硕士论文　论文类型：学位论文

【摘要】：随着以电子商务为代表的互联网近年来的飞速发展,数据量、信息量爆发式的增加,使得在庞大数量的商品中选择出目标用户真正需要商品的难度增大。为了满足这一需求,对在当今社会之中扮演着越来越重要的角色的推荐系统进行细致的研究便有着较大的现实意义。提高推荐系统推荐的准确度,既能为使用其的企业获取巨额经济效益,同时也为使用其的用户提供更加人性化的便捷服务。协同过滤算法在推荐系统中有着众多成功应用,可是该类算法在稀疏数据场景下的表现并不尽如人意。本文从推荐算法的基本概念入手,讨论若干种不同相似度计算方式的协同过滤算法,提出基于巴氏系数的相似度计算方式,通过MovieLens、Netflix和Yahoo Music开源数据进行实验验证其有效性。推荐系统作为一个数据密集型的系统,很容易出现数据爆炸式地增长,本文还针对海量数据情景,分析了Hadoop分布式计算平台的计算原理,以及著名的机器学习框架Mahout中的推荐算法部分进行了详细的介绍,并介绍了其对所提出的基于巴氏系数的协同过滤算法的具体实现所带来的便利,以及其能Hadoop结合使用的原理。最后本文进行了系统原型的设计与实现。具体的介绍了所提出的基于巴氏系数的相似度的协同过滤算法在Mahout中的实现过程,并给出了源代码,然后根据系统长时间运行的必然需求,给出了将单机计算环境中的系统迁移至Hadoop分布式计算平台的具体方案及步骤,用Mahout结合Hadoop的方式解决海量数据带来的计算和储存瓶颈。总结说来,本文的创新点主要体现在以下两点：1)针对协同过滤算法过于依赖共同评分数据的缺陷,在稀疏数据场景下所做出的推荐结果并不准确,为解决这一问题,本文提出了一种新的基于巴氏系数的相似度计算方式,用于协同过滤算法之中,并通过开源数据的实验结果分析,证明了该方式在稀疏场景下的有效性；2)为了实际应用,对Mahout库进行了扩展,增加了本文所研究的基于巴氏系数的协同过滤算法,并给出关键部分的源代码。
[Abstract]:With the rapid development of the Internet represented by electronic commerce in recent years, the amount of data and information explosively increases, which makes it more difficult to select the target user in a large number of commodities. It is of great practical significance to study the recommendation system which plays a more and more important role in today's society. At the same time, it also provides more humanized and convenient service for the users who use it. The collaborative filtering algorithm has many successful applications in the recommendation system. However, the performance of this kind of algorithm in sparse data scene is not satisfactory. This paper starts with the basic concept of recommendation algorithm, and discusses several collaborative filtering algorithms with different similarity calculation methods. The similarity calculation method based on pasteurian coefficient is proposed, and the validity of this method is verified by experiments of Movie Lenser Netflix and Yahoo Music open source data. As a data-intensive system, recommendation system is prone to explosive growth of data. This paper also analyzes the computing principle of Hadoop distributed computing platform and the recommendation algorithm in the famous machine learning framework Mahout. It also introduces the convenience of the proposed collaborative filtering algorithm based on pasteurian coefficient. Finally, the design and implementation of the prototype of the system are given. The implementation process of the proposed similarity filtering algorithm based on pasteurian coefficient in Mahout is introduced in detail, and the source code is given. Then according to the inevitable demand of the system running for a long time, the concrete scheme and steps of migrating the system in the single-machine computing environment to the Hadoop distributed computing platform are given. This paper uses Mahout and Hadoop to solve the bottleneck of computing and storage brought about by massive data. In conclusion, the innovation of this paper is mainly reflected in the following two points: 1) aiming at the defects of collaborative filtering algorithm relying too much on common score data. In order to solve this problem, a new similarity calculation method based on pasteurian coefficient is proposed, which is used in collaborative filtering algorithm. Through the analysis of the experimental results of open source data, it is proved that this method is effective in sparse scenario. In order to practical application, the Mahout library is extended, and the cooperative filtering algorithm based on pasteurian coefficient is added, which is studied in this paper. And gives the key part of the source code.
【学位授予单位】：长江大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.3
，

本文编号：1596692

资料下载

论文发表

支付宝下载

Download by Alipay
微信下载

Download by Wechat
会员下载

Download by Member

本文链接：https://www.wllwen.com/jingjilunwen/dianzishangwulunwen/1596692.html

上一篇：高职电子商务专业人才培养方向的探索
下一篇：连锁快餐企业的网上订单分配模型研究

论文发表

·知网|万方|维普|龙源|省级|国家级|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|