基于Spark的高考推荐系统设计与实现

发布时间：2018-03-21 09:05

本文选题：大数据　切入点：推荐系统　出处：《山东师范大学》2017年硕士论文　论文类型：学位论文

【摘要】：为了解决用户无法获取有价值信息和信息无法被需要的用户所利用的困境,人们提出推荐系统的概念。随着大数据时代的到来,推荐系统也开始面临难以处理海量数据的困境,为了走出困境,与大数据处理技术相结合是必然的趋势。Spark作为大数据处理技术中的佼佼者,提出了RDD的数据模型与基于内存的计算模式,现已被广泛应用于电子商务、视频、社交等领域。但在教育领域内,无论是推荐系统还是大数据处理技术,都涉及较少。高考作为教育领域中的大事件,其志愿填报更是考生关注的焦点。历年的考生志愿录取信息作为考生志愿填报的重要参考数据,因其数据庞大且复杂的特点造成其利用率极低。本文将推荐系统与大数据处理框架Spark相结合,应用于推荐系统与Spark较少涉及的教育领域,帮助考生解决高考志愿填报环节的志愿选择问题。本文完成的工作有以下几点:(1)利用HTML+CSS级联样式表+JSP的前端开发技术,设计开发了高考志愿推荐的Web前端界面。其中包括用户注册界面、用户登录界面、志愿推荐结果展示界面以及相关高考信息(政策、新闻、高校信息与专业信息)的浏览界面。在保证本系统实用性和易用性的同时为用户提供良好的交互体验。(2)以Web前端作为用户日志的生产方,设计性能良好的日志收集模块。首先,采用Flume日志收集工具收集日志信息;其次,通过Sink组件将收集到的信息传送给Kafka消息中间件,利用其功能对日志信息进行统一下发;最后,使用Spark Streaming流式处理框架对Kafka中收集到的日志信息进行清理与提取,并将其存储于HDFS文件系统中。(3)设计高考志愿场景下的志愿推荐引擎。首先,通过阅读大量高考志愿填报文献,选取合适的用户属性,计算相似性,建立相似矩阵,寻找相似用户;其次,分析几种最常见的推荐算法,结合高考志愿填报的真实场景选择基于用户的协同过滤算法作为本系统的推荐算法;最后通过Spark计算框架的并行化计算方式生成最终的推荐列表。(4)搭建Spark分布式集群开发环境,实现系统整体的开发和相关测试。首先,阅读相关文档,在实验室实际环境中搭建具有三个节点的Spark分布式集群开发环境;其次,使用Scala语言编写相关代码,实现系统开发;最后,系统开发完成后对日志收集工具以及Spark相关组件进行性能,确保系统正确高效运行,同时对推荐结果准确度以及整体系统满意度进行测试,保证用户的良好体验。
[Abstract]:In order to solve the dilemma that users can not obtain valuable information and information can not be used by users, people put forward the concept of recommendation system. With the arrival of big data era, recommendation system also began to face the dilemma of dealing with massive data. In order to get out of the dilemma, it is an inevitable trend to combine with big data's processing technology. As a leader in big data processing technology, Spark has put forward the data model and memory-based computing model of RDD, which has been widely used in electronic commerce, video, etc. But in the field of education, neither the recommendation system nor big data's handling techniques are involved. College entrance examination is a major event in the field of education. It is the focus that candidates pay more attention to. The information of candidates' voluntary admission over the years is regarded as an important reference data for candidates to fill in voluntary information. Because of its huge and complex data, its utilization rate is very low. This paper combines the recommendation system with big data processing framework Spark, and applies it to the educational field which is seldom involved in recommendation system and Spark. To help the examinee solve the problem of volunteer selection in the process of filling in the college entrance examination. The work accomplished in this paper is as follows: 1) using the front-end development technology of HTML CSS cascading style sheet JSP, The Web front-end interface of college entrance examination voluntary recommendation is designed and developed, which includes user registration interface, user login interface, volunteer recommendation result display interface and related college entrance examination information (policy, news, etc.). The browsing interface of university information and professional information. While ensuring the practicability and ease of use of this system, it provides a good interactive experience for users. The Web front-end is used as the producer of user log, and a log collection module with good performance is designed. The Flume log collection tool is used to collect log information. Secondly, the collected information is transported to the Kafka message middleware through Sink components, and the log information is distributed uniformly using its functions. Finally, The Spark Streaming streaming processing framework is used to clean up and extract the log information collected in Kafka, and it is stored in the HDFS file system. Select appropriate user attributes, calculate similarity, build similarity matrix, find similar users. Secondly, analyze several common recommendation algorithms. Combined with the real scene of college entrance examination voluntary report, the user-based collaborative filtering algorithm is selected as the recommendation algorithm of the system. Finally, the final recommendation list. 4 is generated by parallelizing the Spark computing framework. Finally, the distributed cluster development environment of Spark is built. First, read the relevant documents, build a three-node Spark distributed cluster development environment in the laboratory environment; secondly, use Scala language to write the relevant code to realize the system development. Finally, after the development of the system, log collection tools and Spark components are performed to ensure the correct and efficient operation of the system. At the same time, the accuracy of the recommended results and the overall system satisfaction are tested to ensure the user's good experience.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】