基于用户兴趣模型的个性化搜索算法研究

发布时间：2019-04-27 10:33

【摘要】：随着Internet上的信息量迅速增长，人们为了搜索到与自己相关的信息，开发了搜索引擎，这是查询资源发展过程中的一次重大里程碑。但是随着人们的需求不断提高，传统搜索引擎的检索精确度低、重复网页多等缺点逐渐显露，以至于已经不能满足用户的需求。为了能更好地满足用户的需求，个性化、智能化成为了搜索引擎发展的趋势。本文对搜索引擎的个性化作了比较深入的研究，主要研究内容如下：首先，通过对现有的用户兴趣模型的研究，提出了一种新的用户兴趣模型构建算法。即在不同粒度上多次使用奇异值分解和k-means聚类算法，将用户浏览历史及其所包含的词在不同层次上进行文档聚类和词聚类，进而创建两棵加权兴趣树：文档类树和词类树。其中，树中每个节点的权值表示用户对该类文档或该类词的感兴趣程度。实验结果表明本文提出的用户兴趣模型在计算页面兴趣分类的准确率上有较大的提高。其次，针对向量空间模型的不足，提出了一种改进方法。即采用奇异值分解技术对其进行降维处理，由此得到的文档-词类矩阵能很好地解决向量空间模型的高维性、稀疏性以及同义词和多义词现象等问题。实验结果表明本文提出的改进的向量空间模型在计算页面分类的准确率上比传统的向量空间模型有较大提高。最后，针对现有的搜索引擎排序算法的不足，提出了一个新的排序算法。即在本文提出的用户兴趣模型的基础上，利用朴素贝叶斯分类器对传统搜索引擎检索得到的文档进行文档分类和词分类，，并根据分类结果进行文档评分，最后将文档根据文档得分降序排列。实验结果表明本文提出的个性化排序算法在相同条件下比基于概率模型的个性化搜索算法的精确度更高，能更好地满足用户的个性化需求。
[Abstract]:With the rapid growth of information on Internet, people have developed a search engine in order to search for information related to themselves, which is a major milestone in the development of query resources. However, with the increasing demand of people, the shortcomings of traditional search engine, such as low retrieval accuracy, repeated pages and so on, are becoming more and more obvious, so that they can not meet the needs of users. In order to better meet the needs of users, individuation, intelligence has become the trend of search engine development. In this paper, the personalization of search engine is deeply studied. The main contents are as follows: firstly, through the study of existing user interest model, a new algorithm for constructing user interest model is proposed. The singular value decomposition (SVD) and k-means clustering algorithm are used to cluster the user's browsing history and its words at different levels, and then two weighted interest trees are created: document class tree and class of speech tree. The weights of each node in the tree represent the degree of interest of the user in this class of documents or words. The experimental results show that the user interest model proposed in this paper has a great improvement in calculating the accuracy of page interest classification. Secondly, aiming at the deficiency of vector space model, an improved method is proposed. In other words, the singular value decomposition (SVD) technique is used to reduce the dimension of the vector space model. The obtained document-class matrix can solve the problems of high dimension, sparsity, synonym and polysemy phenomenon of vector space model. The experimental results show that the improved vector space model is more accurate than the traditional vector space model in calculating page classification. Finally, a new sorting algorithm is proposed to overcome the shortcomings of existing search engine sorting algorithms. On the basis of the user interest model proposed in this paper, the naive Bayesian classifier is used to classify the documents retrieved by the traditional search engine and classify the words, and then the documents are graded according to the classification results. Finally, the document is arranged in descending order according to the document score. The experimental results show that the proposed personalized sorting algorithm is more accurate than the probabilistic model-based personalized search algorithm under the same conditions and can better meet the personalized needs of users.
【学位授予单位】：太原科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】