基于校园资源云的Spark图书推荐技术的研究

发布时间：2018-10-31 17:04

【摘要】：随着高校信息化建设的推进和深入,校园云平台的建设成为各高校关注的焦点。建设校园资源云平台能够满足和保障学校在各方面的需求,而且为校园大数据分析提供了高效可靠的计算存储平台,本课题的研究依托于校园资源云平台,也因此获得了强有力的信息化基础设施的支撑。同时,各种业务管理信息系统的广泛应用,使得数据不断地积累,其中,图书管理应用系统积累了大量图书流通历史数据,并且随着时间的推进,系统内的数据还在不断增多,而这些数据背后潜藏着大量有价值的信息。为了更充分地利用图书馆图书流通数据,改善师生信息化体验,本文对其进行了更深入的分析研究,使师生获得个性化图书推荐服务。本文首先对校园资源云平台进行计算、存储资源及平台功能的设计,然后以云平台作为图书推荐的测试和运行平台,在其上搭建Spark集群,以HDFS为存储系统,Spark为计算平台,对图书推荐技术进行了研究。本文针对数据缺失和数据形式问题,对原始数据进行了预处理,构建了用户-图书评分矩阵。为解决数据稀疏性问题,本文采用了 ALS矩阵分解的协同过滤算法,然后将K-Means聚类算法融入ALS矩阵分解算法中以解决用户冷启动问题,并针对K-Means算法属性权重和初始值问题,利用加权欧式距离和最大最小值算法对其进行了优化。最后在Spark上实现算法,并设计实验进行验证,针对不同的用户实现了个性化图书推荐。通过实验,本文确定了 ALS矩阵分解算法的最优参数,证明了本文提出的混合推荐算法可以解决数据稀疏性问题和冷启动问题,并且K-Means算法的优化提升了聚类效果,同时,聚类算法的融入提高了预测准确率和计算速度。最后,通过Spark平台并行计算加速比验证了 Spark集群的优势。
[Abstract]:With the development of information construction in colleges and universities, the construction of campus cloud platform has become the focus of attention. The construction of campus resource cloud platform can meet and protect the needs of the school in all aspects, and provide an efficient and reliable computing storage platform for the analysis of campus big data. The research of this topic depends on the campus resource cloud platform. Because of this also obtained the strong information infrastructure support. At the same time, the extensive application of various business management information systems makes the data accumulate continuously. Among them, the library management application system accumulates a large number of historical data of the circulation of books, and with the development of time, the data in the system is increasing. And there's a lot of valuable information lurking behind these data. In order to make full use of the library book circulation data and improve the information experience of teachers and students, this paper makes a deeper analysis and research on it, so that teachers and students can get personalized book recommendation service. In this paper, the cloud platform of campus resources is first calculated, storage resources and platform functions are designed, then the cloud platform is used as the test and running platform of book recommendation, on which Spark cluster is built, HDFS as storage system and Spark as computing platform. This paper studies the technology of book recommendation. In order to solve the problem of missing data and data form, this paper preprocesses the original data and constructs the user-book scoring matrix. In order to solve the problem of data sparsity, this paper adopts the cooperative filtering algorithm of ALS matrix decomposition, and then integrates K-Means clustering algorithm into ALS matrix decomposition algorithm to solve the cold start problem of users. Aiming at the problem of attribute weight and initial value of K-Means algorithm, the weighted Euclidean distance and the maximum minimum algorithm are used to optimize the algorithm. Finally, the algorithm is implemented on Spark, and the experiment is designed to verify the implementation of personalized book recommendation for different users. Through experiments, the optimal parameters of ALS matrix decomposition algorithm are determined. It is proved that the proposed hybrid recommendation algorithm can solve the problem of data sparsity and cold start, and the optimization of K-Means algorithm can improve the clustering effect. The integration of clustering algorithm improves the prediction accuracy and computing speed. Finally, the advantage of Spark cluster is verified by parallel computing speedup on Spark platform.
【学位授予单位】：西安科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】