个性化新闻推荐引擎中新闻分组聚类技术的研究与实现
发布时间:2018-02-27 12:04
本文关键词: 推荐 引擎 文本特征抽取 文本聚类 LSH 出处:《北京邮电大学》2013年硕士论文 论文类型:学位论文
【摘要】:随着互联网的快速发展,人们每天需要面对海量信息资讯,如何快速从中获得有价值的信息成为亟待解决的问题,而仅仅依靠搜索引擎并不足以使用户能获得高质量的适合于自己的有效信息。为了应对这个挑战,个性化信息推荐成为了近几年来一个热门的研究领域。 本文围绕个性化新闻推荐系统在实际应用中的海量数据瓶颈,重点对新闻文本聚类技术进行研究、分析和实现。本文的主要工作如下:首先,本文研究了个性化新闻推荐系统的研究应用现状,由此引出了海量新闻文本聚类问题。对现有的文本聚类相关技术及聚类方案进行了深入研究,分析其技术思想、应用领域及优缺点。然后,为了满足推荐系统实际应用中对可扩展性和效率的要求,本论文采用基于LSH的文本分组聚类算法对新闻文本进行聚类处理;同时,为了满足新闻主题和内容双重相关需求,设计了层次化的文本分组聚类方案,在文本内容特征的基础上,加入文本主题特征表示,对主题特征进行空间转换,以及内容特征和主题特征的加权转换,使其能够应用于LSH分组聚类算法,从而实现了文本聚类过程中文本特征的深度挖掘,保证聚类准确率的同时有效提高了聚类的性能。最后,基于本文提出的新闻分组聚类方案,设计并实现了基于该方案的新闻聚类系统,描述了聚类系统的实现流程、数据库设计和功能模块的设计实现。为了验证系统的可用性、准确性和效率,使用此系统对数据集进行实验,得到层次化结构的新闻分组聚类结果,并将系统聚类的结果和标准的人工分类文本结果进行对比,通过对聚类结果进行评估,验证算法的改进效果。 论文主要内容的组织如下:第二章,对个性化新闻推荐引擎进行概述,重点分析了该技术目前面临的性能瓶颈,并引出通过文本聚类技术作为解决方案;第三章对文本聚类技术进行了介绍,分析了几种主要文本聚类算法的原理;第四章,针对新闻推荐系统的特殊需求,提出基于LSH分组聚类算法的新闻文本聚类方案;第五章,介绍了基于该方案的新闻聚类系统的设计与实现;第六章,给出了该系统测试和实验结果,并对实验结果进行了分析。
[Abstract]:With the rapid development of the Internet, people have to face a lot of information every day. How to get valuable information quickly becomes an urgent problem. In order to meet the challenge, personalized information recommendation has become a hot research field in recent years. This paper focuses on the bottleneck of mass data in the application of personalized news recommendation system, and focuses on the research, analysis and implementation of news text clustering technology. The main work of this paper is as follows: first, This paper studies the current situation of research and application of personalized news recommendation system, which leads to the problem of mass news text clustering. The existing text clustering related technologies and clustering schemes are deeply studied, and their technical ideas are analyzed. Then, in order to meet the requirements of scalability and efficiency in the practical application of the recommendation system, this paper adopts the text grouping clustering algorithm based on LSH to cluster the news text; at the same time, In order to meet the dual demand of news topic and content, a hierarchical text clustering scheme is designed. Based on the text content feature, the text theme feature representation is added to transform the topic feature space. And the weighted transformation of content feature and topic feature can be applied to LSH clustering algorithm, thus realizing the deep mining of text feature in text clustering process. At the same time, the accuracy of clustering is guaranteed and the performance of clustering is improved effectively. Finally, a news clustering system based on this scheme is designed and implemented, and the realization flow of the clustering system is described. In order to verify the usability, accuracy and efficiency of the system, this system is used to test the data set, and the hierarchical news grouping clustering results are obtained. The results of the system clustering are compared with the results of the standard manual text classification, and the improved algorithm is verified by the evaluation of the clustering results. The main contents of this paper are as follows: in chapter 2, the personalized news recommendation engine is summarized, and the performance bottleneck of the technology is analyzed, and the text clustering technology is used as the solution. The third chapter introduces the text clustering technology, analyzes the principle of several main text clustering algorithms; chapter 4th, according to the special needs of news recommendation system, puts forward the news text clustering scheme based on LSH clustering algorithm; chapter 5th, This paper introduces the design and implementation of news clustering system based on this scheme, and gives the test and experimental results of the system in Chapter 6th, and analyzes the experimental results.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【参考文献】
相关期刊论文 前6条
1 卢祖友;桑永胜;;基于球向量机的中文文本分类[J];计算机工程与科学;2008年12期
2 尉景辉,何丕廉,孙越恒;基于K-Means的文本层次聚类算法研究[J];计算机应用;2005年10期
3 胡洁;;高维数据特征降维研究综述[J];计算机应用研究;2008年09期
4 许海玲;吴潇;李晓东;阎保平;;互联网推荐系统比较研究[J];软件学报;2009年02期
5 章成志;王惠临;;多语言文本聚类研究综述[J];现代图书情报技术;2009年06期
6 杜红斌;夏克文;刘南平;吴涛;;一种改进的基于广义后缀树的文本聚类算法[J];信息与控制;2009年03期
相关硕士学位论文 前3条
1 姚清耘;基于向量空间模型的中文文本聚类方法的研究[D];上海交通大学;2008年
2 刘强;文本的特征提取及KNN分类优化问题研究[D];华南理工大学;2009年
3 唐朝;资源自适应个性化新闻推荐系统的研究与实现[D];浙江大学;2010年
,本文编号:1542592
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1542592.html