基于MapReduce框架的混合推荐算法

发布时间：2018-05-30 03:22

本文选题：协同过滤 + 混合推荐系统　；参考：《长春工业大学》2017年硕士论文

【摘要】：互联网信息的爆炸式增长、信息的种类变得纷繁复杂以及新兴电子商务服务的出现使得信息过载的情况变得越来越严重。因而在信息过滤工具中,推荐系统的地位也变得越来越重要。在实际使用的系统中,使用最多的个性化推荐方法就是协同过滤算法。但随着推荐系统规模的不断扩大,传统的推荐算法大多都会遇到严重的计算瓶颈,且大量的数据并未显著提高推荐算法的精度。因此,为了应对不断增长的数据规模,对协同过滤推荐算法的并行化改造是十分必要的。本文对基于MapReduce并行计算框架的协同过滤推荐算法的设计及实现进行了研究。首先使用MapReduce框架对算法进行并行化,之后针对不同算法进行优化。对于基于物品的协同过滤算法,使用共现矩阵替换相似度矩阵,降低计算相似度矩阵所消耗的时间;在计算推荐结果的时候,使用Top-N的方法选择最近邻进行计算,降低算法的计算量。对于基于用户的协同过滤算法,将数据使用聚类的方法进行分组。对每个分组的数据,将同一分组的用户作为最近邻,计算组内推荐值;使用所有的中心用户作为近邻,计算出组间推荐值。将这三个推荐结果作为训练数据,实际评分作为输出结果,使用线性回归的方法进行建模。针对这个模型,定义损失函数后,使用梯度下降的方法求出最优的混合比例。具体来说,通过将数据进行十折交叉,划分出多个数据分组,通过不同的Top-N值及数据分组,可以训练出不同的混合参数,再使用这个参数对所有的数据分组计算出MAE值和RMSE值的均值。通过比较计算出的均值,选择最优的混合系数和Top-N值。在实验中通过对前述两个算法所产生的三份推荐结果进行混合来产生最终的推荐结果,并对推荐结果的精度进行了验证。同时针对程序的运行时间,评估了改进后的算法的性能。实验结果表明,修改后的协同过滤算法,不仅提高了协同过滤算法对大规模数据的处理能力,同时通过对不同结果的混合,提高了算法的精度。与基于物品的协同过滤算法相比,算法的准确率有明显提升,且程序运行时间有明显的下降;与基于用户的协同过滤算法相比,算法的准确率提升明显,而通过分组的方式也降低了算法在计算相似度矩阵和计算结果所消耗的时间,效率有明显提升。
[Abstract]:With the explosive growth of Internet information, the variety of information becomes complicated and the emergence of new e-commerce services makes the situation of information overload more and more serious. Therefore, the status of recommendation system has become more and more important in information filtering tools. In the practical system, collaborative filtering algorithm is the most popular personalized recommendation method. However, with the continuous expansion of the scale of recommendation system, most of the traditional recommendation algorithms will encounter serious computational bottlenecks, and a large number of data have not significantly improved the accuracy of the recommendation algorithm. Therefore, in order to cope with the growing data scale, the parallel transformation of collaborative filtering recommendation algorithm is very necessary. This paper studies the design and implementation of collaborative filtering recommendation algorithm based on MapReduce parallel computing framework. Firstly, the algorithm is parallelized by MapReduce framework, and then optimized for different algorithms. For the collaborative filtering algorithm based on articles, the co-occurrence matrix is used to replace the similarity matrix to reduce the time consumed in calculating the similarity matrix. When calculating the recommended results, Top-N is used to select the nearest neighbor for calculation. Reduce the computational complexity of the algorithm. For the user-based collaborative filtering algorithm, the data is grouped by clustering method. For the data of each packet, the user of the same packet is taken as the nearest neighbor to calculate the recommended value in the group, and all the central users are used as the nearest neighbor to calculate the recommended value between the groups. The three recommended results are taken as training data and the actual score is taken as the output result. The linear regression method is used to model the model. For this model, the optimal mixing ratio is obtained by gradient descent after the loss function is defined. Specifically, the data can be divided into several data groups by ten fold crossing, and different mixed parameters can be trained by different Top-N values and data grouping. Then we use this parameter to calculate the mean values of MAE and RMSE for all the data groups. By comparing the calculated mean value, the optimal mixing coefficient and Top-N value are selected. In the experiment, the three recommended results are mixed to produce the final recommendation results, and the accuracy of the recommended results is verified. At the same time, the performance of the improved algorithm is evaluated according to the running time of the program. Experimental results show that the modified collaborative filtering algorithm not only improves the ability of collaborative filtering algorithm to deal with large-scale data, but also improves the accuracy of the algorithm by mixing different results. Compared with the collaborative filtering algorithm based on articles, the accuracy of the algorithm is obviously improved, and the running time of the program is obviously decreased; compared with the collaborative filtering algorithm based on users, the accuracy of the algorithm is obviously improved. By grouping, the efficiency of the algorithm is greatly improved by reducing the time consumed in computing the similarity matrix and the results.
【学位授予单位】：长春工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【相似文献】