基于知识库与文本分类算法的用户兴趣点挖掘研究

发布时间：2018-01-21 07:57

本文关键词： 知识库关键词分类 URL分类用户兴趣投射　出处：《天津师范大学》2013年硕士论文　论文类型：学位论文

【摘要】：近年来,随着互联网的飞速发展,人们可以通过网络检索自己所需要的信息。目前搜索引擎已经成为重要的检索工具,但由于检索结果没有针对不同的用户做相应的处理,使得不同用户所获得的信息是相同的,从而忽略了用户的兴趣爱好,并不能满足用户真正的个性化需求。面对海量的网络信息,如何挖掘出用户的兴趣点,为用户提供个性化服务已经成为当前研究的重要课题。用户兴趣点的挖掘是从用户的浏览历史记录中挖掘出用户的兴趣点,其结果直接反映了个性化服务的准确性和有效性,本文即立足于用户兴趣点的挖掘开展了相关研究。本文对相关的用户兴趣点挖掘算法进行了详细的分析和对比,针对现有用户兴趣点挖掘算法的局限性,提出了基于知识库与文本分类算法来挖掘用户的兴趣点的基本思想。本文在英文语料研究下进行的,首先利用Lucene建立基于Wikipedia的知识库,然后对用户输入的关键词、用户输入的URL进行分类,最后进行用户兴趣的投射。其中对于关键词分类,提出了基于共现词和WordNet扩展相结合的分类方法；对于URL分类,提出了基于块的网页正文提取法、基于DFSD的特征提取法；对于用户兴趣投射,提出了基于上下文环境的投射法,将用户候选兴趣点映射为一个兴趣点,从而挖掘出用户真正的兴趣点；最后通过对比实验体现了算法的高效性和准确性。
[Abstract]:In recent years, with the rapid development of the Internet, people can retrieve the information they need through the Internet. At present, search engine has become an important retrieval tool. However, because the retrieval results do not deal with different users, the information obtained by different users is the same, thus ignoring the interests of users. Facing the huge amount of network information, how to dig out the user's interest point and provide the personalized service for the user has become an important topic in the current research. User interest point mining is to mine user interest points from the user's browsing history records. The results directly reflect the accuracy and effectiveness of personalized services. In this paper, based on the mining of user interest points, the relevant research has been carried out. This paper makes a detailed analysis and comparison of the relevant user point of interest mining algorithm, aiming at the limitations of the existing user point of interest mining algorithm. This paper presents the basic idea of mining users' points of interest based on knowledge base and text classification algorithm. Firstly, the knowledge base based on Wikipedia is built by using Lucene, and then the keywords entered by users and the URL input by users are classified. Finally, the projection of user interest is carried out. For keyword classification, a classification method based on co-occurrence word and WordNet extension is proposed. For URL classification, a block based text extraction method and a DFSD based feature extraction method are proposed. For user interest projection, a context-based projection method is proposed to map user candidate interest points to a point of interest, thus mining out the real interest points of users. Finally, the high efficiency and accuracy of the algorithm are demonstrated through comparative experiments.
【学位授予单位】：天津师范大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1;TP311.13

【参考文献】