基于Spark的推荐系统的研究

发布时间：2018-04-20 15:10

本文选题：推荐系统 + 协同过滤算法　；参考：《浙江理工大学》2017年硕士论文

【摘要】：随着互联网和信息技术的高速发展,有海量的信息数据产生,怎么能够从纷繁复杂的信息中,获取有价值的数据是一个亟待解决的问题。推荐系统是解决这一问题的有效方法之一,推荐系统是一种从用户的历史行为以及喜好信息中给目标用户推荐产品的应用,广泛地应用于电子商务、视频音乐门户网站等多个邻域。然而依然存在数据稀疏性、冷启动、系统预测准确率不理想的问题。特别是随着用户数以及物品数不断增加,基于单机的传统推荐算法遇到不可扩展性的瓶颈,很难满足当今的商业需求,而结合分布式计算平台的并行化实现为解决这个问题提供了新的思路。Spark是一种新型的基于内存的通用并行化大数据计算引擎,由于其迭代并行化的计算优势,在大数据处理方面得到广泛的关注,本文主要研究了基于邻域和基于模型的推荐算法,针对其稀疏性、冷启动及预测准确率不理想的问题,进行算法改进,并将其在Spark集群上并行化设计与实现优化算法。具体的研究的方面如下:(1)针对基于用户的协同过滤算法存在的评分数据稀疏情况下推荐预测准确率不理想的问题,引入了用户属性特征相似度。本文在计算用户相似度时,组合了用户属性特征相似度和用户协同过滤相似度,以此来缓解评分数据稀疏性对计算用户相似度的影响。并在Spark平台实现了优化后的算法,通过实验结果分析,优化的基于用户的协同过滤算法,提高了推荐预测准确率,也改善了算法的执行效率。(2)针对基于物品的协同过滤算法存在冷启动情况下预测准确率不理想的问题,引入了物品属性特征相似度。本文在计算物品相似度度时,组合了物品属性特征相似度和评分数据相似度,以此来降低冷启动问题对物品相似度计算的负面影响。并在Spark平台并行化设计和实现了优化的算法,通过实验结果分析,优化的基于物品的协同过滤算法提高了系统预测准确率。(3)针对基于ALS模型的推荐算法,本文设计了一种新的目标函数,融合了模型训练前的用户及物品相似性信息。并在Spark平台并行化设计和实现了基于ALS模型的推荐算法,同过实验结果分析,新的模型目标函数下,有较好的预测准确率,也提高了算法的执行效率。
[Abstract]:With the rapid development of Internet and information technology, there is a huge amount of information data. How to obtain valuable data from the complicated information is an urgent problem to be solved. Recommendation system is one of the effective methods to solve this problem. Recommendation system is a kind of application of recommending products to target users from user's historical behavior and preference information, which is widely used in electronic commerce. Video music portal and other neighborhoods. However, there are still some problems, such as data sparsity, cold start, and system prediction accuracy. Especially, with the increasing number of users and items, the traditional recommendation algorithm based on single machine meets the bottleneck of inextensibility, so it is difficult to meet the needs of today's business. The parallelization of distributed computing platform provides a new way to solve this problem. Park .Sch is a new memory based general-purpose parallel big data computing engine, because of its advantage of iterative parallelization. In this paper, we mainly study the recommendation algorithm based on neighborhood and model, aiming at the problems of sparse, cold start and poor prediction accuracy, we improve the algorithm. The optimization algorithm is designed and implemented in parallel on Spark cluster. The specific aspects of the research are as follows: (1) aiming at the problem that the recommendation prediction accuracy is not ideal in the case of sparse scoring data in the user-based collaborative filtering algorithm, the similarity of user attribute features is introduced. In this paper, we combine user attribute feature similarity and user collaborative filtering similarity to mitigate the influence of score data sparsity on the calculation of user similarity. The optimized algorithm is implemented on the Spark platform. Through the analysis of experimental results, the optimized collaborative filtering algorithm based on users can improve the accuracy of recommendation prediction. It also improves the execution efficiency of the algorithm. (2) aiming at the problem that the prediction accuracy is not ideal in the cold start case, the article attribute feature similarity is introduced in the article based collaborative filtering algorithm. In order to reduce the negative effect of cold start problem on the calculation of item similarity, this paper combines the similarity of attribute features of items and the similarity of scoring data to calculate the similarity of items. The optimization algorithm is designed and implemented in parallel on Spark platform. Through the analysis of experimental results, the optimized object-based collaborative filtering algorithm improves the prediction accuracy of the system. In this paper, a new objective function is designed, which combines user and object similarity information before model training. The algorithm based on ALS model is designed and implemented in parallel on Spark platform. With the analysis of experimental results, the prediction accuracy is better and the efficiency of the algorithm is improved under the new model objective function.
【学位授予单位】：浙江理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】