基于知识库与文本分类算法的用户兴趣点挖掘研究
发布时间:2018-01-21 07:57
本文关键词: 知识库 关键词分类 URL分类 用户兴趣投射 出处:《天津师范大学》2013年硕士论文 论文类型:学位论文
【摘要】:近年来,随着互联网的飞速发展,人们可以通过网络检索自己所需要的信息。目前搜索引擎已经成为重要的检索工具,但由于检索结果没有针对不同的用户做相应的处理,使得不同用户所获得的信息是相同的,从而忽略了用户的兴趣爱好,并不能满足用户真正的个性化需求。面对海量的网络信息,如何挖掘出用户的兴趣点,为用户提供个性化服务已经成为当前研究的重要课题。 用户兴趣点的挖掘是从用户的浏览历史记录中挖掘出用户的兴趣点,其结果直接反映了个性化服务的准确性和有效性,本文即立足于用户兴趣点的挖掘开展了相关研究。 本文对相关的用户兴趣点挖掘算法进行了详细的分析和对比,针对现有用户兴趣点挖掘算法的局限性,提出了基于知识库与文本分类算法来挖掘用户的兴趣点的基本思想。本文在英文语料研究下进行的,首先利用Lucene建立基于Wikipedia的知识库,然后对用户输入的关键词、用户输入的URL进行分类,最后进行用户兴趣的投射。其中对于关键词分类,提出了基于共现词和WordNet扩展相结合的分类方法;对于URL分类,提出了基于块的网页正文提取法、基于DFSD的特征提取法;对于用户兴趣投射,提出了基于上下文环境的投射法,将用户候选兴趣点映射为一个兴趣点,从而挖掘出用户真正的兴趣点;最后通过对比实验体现了算法的高效性和准确性。
[Abstract]:In recent years, with the rapid development of the Internet, people can retrieve the information they need through the Internet. At present, search engine has become an important retrieval tool. However, because the retrieval results do not deal with different users, the information obtained by different users is the same, thus ignoring the interests of users. Facing the huge amount of network information, how to dig out the user's interest point and provide the personalized service for the user has become an important topic in the current research. User interest point mining is to mine user interest points from the user's browsing history records. The results directly reflect the accuracy and effectiveness of personalized services. In this paper, based on the mining of user interest points, the relevant research has been carried out. This paper makes a detailed analysis and comparison of the relevant user point of interest mining algorithm, aiming at the limitations of the existing user point of interest mining algorithm. This paper presents the basic idea of mining users' points of interest based on knowledge base and text classification algorithm. Firstly, the knowledge base based on Wikipedia is built by using Lucene, and then the keywords entered by users and the URL input by users are classified. Finally, the projection of user interest is carried out. For keyword classification, a classification method based on co-occurrence word and WordNet extension is proposed. For URL classification, a block based text extraction method and a DFSD based feature extraction method are proposed. For user interest projection, a context-based projection method is proposed to map user candidate interest points to a point of interest, thus mining out the real interest points of users. Finally, the high efficiency and accuracy of the algorithm are demonstrated through comparative experiments.
【学位授予单位】:天津师范大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1;TP311.13
【参考文献】
相关期刊论文 前7条
1 张海粟;马大明;邓智龙;;基于维基百科的语义知识库及其构建方法研究[J];计算机应用研究;2011年08期
2 薛伟莲;王蕴慧;;一种基于对话的电子商务推荐系统[J];辽宁师范大学学报(自然科学版);2011年02期
3 李霞;蒋盛益;;基于DOM树及行文本统计去噪的网页文本抽取技术[J];山东大学学报(理学版);2012年03期
4 陆晓曦;;ODP分类体系初探[J];山东图书馆学刊;2009年01期
5 任翔;刘彬;;基于超链接分析的网页正文提取方法[J];泰山学院学报;2010年03期
6 范云杰;刘怀亮;;基于维基百科的中文短文本分类研究[J];现代图书情报技术;2012年03期
7 马宏伟;张光卫;李鹏;;协同过滤推荐算法综述[J];小型微型计算机系统;2009年07期
,本文编号:1450916
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1450916.html