基于文本分类与主题模型的用户偏好分析

发布时间：2018-06-16 20:14

本文选题：用户偏好分析 + 文本分类　；参考：《青岛科技大学》2017年硕士论文

【摘要】：用户偏好是指用户通过对商品或服务的考量后,所做出的理性的具有倾向性的选择。对用户偏好进行分析的主要目的是为了从海量的信息中,筛选出用户感兴趣的信息,从而为用户提供更个性化的服务。因此用户偏好分析是构建个性化服务的基础。然而,现有的用户偏好分析方法还存在着许多问题。一方面,现有的方法大多是对用户的固有属性进行分析,很难挖掘出用户更细粒度的偏好;另一方面,现有的方法在对用户细粒度偏好进行分析时,其算法准确率和算法效率上都有所不足。用户偏好可以通过挖掘用户的行为得到,通过对用户浏览的内容进行细粒度的分类、聚类,就可以得到用户的细粒度偏好。首先,标签是一种比类别更加细粒度的表示,并且一个内容可以对应有多个标签,在对内容进行不同层面的标签标注可以为用户偏好分析提供不同层面的偏好特征;其次,根据用户的主动意图进行聚类,从用户角度出发,根据用户的潜在认知,把同类内容聚合到一起,为用户偏好分析提供用户行为层面的偏好特征。基于上述分析,本文提出了两种对文本进行标签标注的算法和一种无向图层次聚类优化算法:首先,提出一种加权的有监督LDA算法(WLLDA),该算法采用卡方校验的方法对文本特征进行降维。采用一种新的加权词袋模型,对原有词袋中对主题分类有意义的词进行提权,增大主题间的分歧,提高分类准确率。采用多模型集成的方法,针对不同频次的主题进行采样训练,解决单一模型因语料不均匀造成的互相干扰。提出一种新的主题贴近度计算方法,在原有主题概率的基础上,综合考虑了关键词命中频率、频次以及标签支持度这三个方面的因素来计算主题贴近度,从而提高主题预测的准确度。其次,提出一种基于word2vec的标签标注算法,该算法利用CRF对文本进行关键词提取,使用word2vec产生的词向量和LR对关键词进行关键词聚类并构建标签集合,避免了人工标签库归纳覆盖不全的问题。最后通过对文本进行去噪提取文本主干,通过比较文本主干词的词向量和标签词向量的相似度为文本进行标签标注。第三,提出一种无向图层次聚类并行化优化算法,该算法把用户主动搜索意图行为抽象为无向图。通过对多边节点进行分裂,减弱了衰减因子对多边节点带来的负面影响,同时使无向图图聚类可以以并行的方式进行计算,在准确率和计算效率上都有了大幅度提升。本文通过上述三种算法,把用户对内容的偏好程度转变为用户对标签的偏好,最终刻画出用户细粒度的偏好特征,从而达到对用户偏好进行分析的目的。
[Abstract]:User preference refers to the rational and tendentious choice made by the user through the consideration of goods or services. The main purpose of analyzing users' preferences is to screen out the information that users are interested in from a large amount of information, so as to provide users with more personalized services. Therefore, user preference analysis is the basis of building personalized services. However, there are still many problems in the existing methods of user preference analysis. On the one hand, most of the existing methods analyze the inherent properties of the user, so it is difficult to mine the user's finer grained preferences. On the other hand, the existing methods are used to analyze the user's fine-grained preferences. Its algorithm accuracy and algorithm efficiency are insufficient. The user preference can be obtained by mining the user's behavior, and the user's fine-grained preference can be obtained by the fine-grained classification and clustering of the content viewed by the user. First, tags are a more granular representation than categories, and a content can correspond to multiple tags. Label tagging at different levels of content can provide different levels of preferences for user preference analysis. Clustering according to the active intention of users, from the point of view of users, according to the potential cognition of users, the same content is aggregated together to provide user preference analysis with user preference characteristics at behavioral level. Based on the above analysis, this paper proposes two algorithms for tagging text and an undirected graph hierarchical clustering optimization algorithm. A weighted supervised LDA algorithm (WLLDAA) is proposed. The algorithm uses chi-square check to reduce the dimension of text features. A new weighted lexical bag model is used to raise the weight of the words in the original lexical bag to increase the differences between themes and to improve the accuracy of classification. The method of multi-model integration is used to train samples for different frequency topics to solve the interferences caused by uneven corpus in a single model. A new method for calculating topic closeness is proposed. Based on the original topic probability, the key word hit frequency, frequency and label support are considered comprehensively to calculate the subject closeness. In order to improve the accuracy of topic prediction. Secondly, a label tagging algorithm based on word2vec is proposed, in which the keywords are extracted from the text, the word vectors and LR generated by word2vec are used to cluster the keywords and the tag set is constructed. Avoid the problem of incomplete inductive coverage of human tag library. Finally, the text trunk is extracted by de-noising the text, and the similarity between the word vector of the main word and the label vector is compared to label the text. Thirdly, an undirected graph hierarchical clustering parallel optimization algorithm is proposed, which abstracts the user's active search intention behavior into undirected graph. By splitting the multilateral nodes, the negative effects of the attenuation factor on the multilateral nodes are reduced, and the undirected graph clustering can be computed in parallel, which greatly improves the accuracy and computational efficiency. In this paper, the degree of user's preference for content is transformed into user's preference for label by the three algorithms mentioned above, and the fine granularity of user's preference is depicted finally, so as to achieve the purpose of analyzing user's preference.
【学位授予单位】：青岛科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】