基于文本语义的个性化图书推荐

发布时间：2018-07-28 17:07

【摘要】：互联网中积累的海量图书标签、摘要为分析阅读兴趣和构建个性化图书推荐系统提供了新的数据来源。因此本文主要研究如何整合标签、摘要等文本数据,构建个性化图书推荐系统,提升系统性能。本文工作可以分为基于语义的兴趣偏好模型、推荐算法的设计和基于Spark平台的并行化实现三个部分。首先提出基于词向量和共现频次计算标签语义相似度的算法,并针对具体场景设计优化方式。然后分别使用PIC算法和LDA算法建立基于标签和图书摘要的语义偏好模型,并采用基于语义偏好的协同过滤扩展算法生成图书推荐列表。最后,在Spark分布式计算平台上并行化实现推荐系统。本文首先介绍了课题的研究背景与意义,在相关文献的基础上,总结了影响个性化推荐系统性能的关键问题,明确了本文的具体研究内容。其次,本文研究了语义分析、聚类、推荐算法等课题关键技术,指出各种技术的优缺点,是后续研究的理论基础。再者,建立基于文本语义的兴趣偏好模型。其中,引入衰减函数作为权重,解决标签偏好的时间效应问题;提出基于词向量和共现频次计算标签相似度的算法,并针对本课题的具体场景设计优化方式,提升相关性计算的准确度;基于PIC算法实现标签聚类,建立基于标签语义的兴趣偏好模型,解决了标签的稀疏问题;利用LDA算法分析图书摘要潜在主题分布,建立摘要语义偏好模型,解决标签过少引起的冷启动问题。本文使用基于语义偏好的协同过滤扩展算法生成推荐结果,并设计实验测试系统性能。实验结果表明:(1)基于文本语义的阅读兴趣偏好特征能够正确地反映用户兴趣偏好;(2)推荐算法在准确率、多样性等指标上表现良好。最后设计实现基于Spark分布式计算平台的推荐系统。实现的主要模块有词向量训练、LDA主题分析、标签聚类和协同过滤扩展算法。前三者基于Spark机器学习库MLlib提供的接口实现。协同过滤扩展算法包括基于项目和基于用户两种模式,本文针对具体模块设计了实现流程。实测证明各种算法加速性能显著。
[Abstract]:The vast amount of book labels accumulated in the Internet provides a new data source for analyzing reading interest and building personalized book recommendation system. Therefore, this paper mainly studies how to integrate tags, abstracts and other text data, build personalized book recommendation system, and improve the system performance. This paper can be divided into three parts: interest preference model based on semantics, the design of recommendation algorithm and the implementation of parallelization based on Spark platform. Firstly, an algorithm based on word vector and co-occurrence frequency to calculate the semantic similarity of label is proposed, and the optimization method is designed for the specific scene. Then PIC algorithm and LDA algorithm are used to build semantic preference model based on label and book digest, and cooperative filtering extension algorithm based on semantic preference is used to generate book recommendation list. Finally, the recommendation system is implemented by parallelization on Spark distributed computing platform. This paper first introduces the research background and significance of the subject, summarizes the key issues affecting the performance of the personalized recommendation system based on the relevant literature, and clarifies the specific research content of this paper. Secondly, this paper studies the key technologies of semantic analysis, clustering and recommendation algorithms, and points out the advantages and disadvantages of these technologies, which are the theoretical basis for further research. Furthermore, interest preference model based on text semantics is established. Among them, the attenuation function is introduced as the weight to solve the time effect problem of label preference, and an algorithm based on word vector and co-occurrence frequency to calculate label similarity is proposed. Improve the accuracy of correlation calculation; implement tag clustering based on PIC algorithm, establish interest preference model based on label semantics, solve the sparse problem of labels; use LDA algorithm to analyze the distribution of potential topics in book abstracts. The semantic preference model is established to solve the cold start problem caused by too few tags. In this paper, the extended collaborative filtering algorithm based on semantic preference is used to generate recommendation results, and the performance of the system is tested experimentally. The experimental results show that: (1) the feature of reading interest preference based on text semantics can correctly reflect the user's interest preference; (2) the recommendation algorithm performs well in terms of accuracy and diversity. Finally, the recommendation system based on Spark distributed computing platform is designed and implemented. The main modules are word vector training LDA topic analysis, tag clustering and collaborative filtering expansion algorithm. The first three are implemented based on the interface provided by Spark machine learning library MLlib. The extended collaborative filtering algorithm includes two modes: project-based and user-based. This paper designs the implementation flow for specific modules. The experimental results show that the acceleration performance of various algorithms is remarkable.
【学位授予单位】：东南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】