基于机器学习的问答推荐系统问题推荐模型研究

发布时间：2018-09-12 06:40

【摘要】：本文所描述的问题推荐模型是基于某互动中文问答平台所开发个性化推荐系统。该中文问答平台上存在着大量未被回答的问题，个性化推荐系统能够根据用户的注册信息以及其在该互动问答平台上的登录、浏览和回答等行为，为用户推荐相关问题，以降低用户找到能够回答的待解决问题的成本，提高问题的回答量，更好地进行知识分享。该问题推荐系统的推荐模型采用的是基于机器学习技术构建的基于内容的推荐算法，借鉴了精准定向广告系统的思路，以推荐问题的点击率作为系统的优化目标，结合中文分词[76,77,78,79]、关键词提取、命名实体识别(Named Entity Recognition，NER)[81,82,83,84]等技术，建立点击率(CTR)预估模型来匹配用户与问题。点击率预估模型计算条件概率P(click=true|user=uid, question=qid)，即以新问题被用户点击的概率作为用户与新问题匹配程序的度量，并使用最大熵(Max Entropy)模型来拟合上述条件概率。原始版本的问题推荐模型存在以下两点不足：首先是推荐模型仅使用了非常少量的特征。特征的维度少导致模型容易出现欠拟合的现象。其次，静态的推荐模型无法适应数据分布的变化所造成的影响。本文的工作在于改进了原始版本的问题推荐模型，，具体而言包括以下两个方面的工作： 1.通过在问题推荐模型中引入语义特征、组合特征以及偏置项等，结合模型选择与正则化技术，提高了推荐模型的准确率。改进后的模型使用了概率潜在语义分析(probability Latent Semantic Analysis,pLSA)技术提取问题文本的语义特征。在语义层面对文本进行处理能够获得比在词汇层面更好的效果。原有推荐模型在基准数据集上的准确率为88%，改进后的模型在基准数据集上的准确率为95%。 2.设计并实现了问题推荐模型的离线训练系统。该系统能够完成基础数据自动下载、特征提取、模型训练与模型选择等功能，能够实现问题推荐模型的离线训练与定期更新。设计离线训练系统的目的在于定期产出新的推荐模型。实验结果证明问题推荐模型的数据分布具有时序性，使用静态模型无法适应数据分布变化的影响。改进后的问题推荐模型以及离线训练系统已经上线，为该互动中文问答系统的用户提供更加准确的个性化问题推荐服务。
[Abstract]:The question recommendation model described in this paper is based on a personalized recommendation system developed by an interactive Chinese question answering platform. There are a large number of unanswered questions on the Chinese question answering platform. The personalized recommendation system can recommend the relevant questions to the user according to the user's registration information and their login, browse and answer behavior on the interactive question answering platform. In order to reduce the cost of users to find the problem to be answered, improve the number of answers, better knowledge sharing. The recommendation model of the problem recommendation system adopts the content-based recommendation algorithm based on the machine learning technology, and draws lessons from the idea of the precision directed advertising system, and takes the click rate of the recommendation problem as the optimization goal of the system. Combined with the techniques of Chinese word segmentation [76 / 77/ 7/ 78/ 78/ 79], keyword extraction and named entity recognition (Named Entity Recognition,NER) [81 / 82/ 83/ 84], a (CTR) prediction model of click rate was established to match the user and the problem. The conditional probability P (click=true user=uid, question=qid) is calculated by using the prediction model of click rate, that is, the probability of the new problem being clicked by the user is taken as the measure of the matching program between the user and the new problem, and the maximum entropy (Max Entropy) model is used to fit the conditional probability. The original version of the problem recommendation model has the following two shortcomings: the first is that the recommendation model only uses a very small number of features. The lack of feature dimension leads to the underfitting of the model. Secondly, the static recommendation model can not adapt to the change of data distribution. The work of this paper is to improve the original version of the problem recommendation model, specifically including the following two aspects of work: 1. By introducing semantic features, combination features and bias items into the problem recommendation model, the accuracy of the recommendation model is improved by combining model selection and regularization techniques. The improved model uses probabilistic latent semantic analysis (probability Latent Semantic Analysis,pLSA) technique to extract semantic features of problem text. Text processing at the semantic level can achieve better results than at the lexical level. The accuracy of the original recommendation model on the datum data set is 88 and that of the improved model on the datum data set is 95. 2. 2. An offline training system for problem recommendation model is designed and implemented. The system can automatically download basic data, feature extraction, model training and model selection, and can realize offline training and periodic updating of problem recommendation model. The purpose of designing an offline training system is to produce a new recommendation model on a regular basis. The experimental results show that the data distribution of the problem recommendation model is time-series, and the static model can not adapt to the influence of the change of the data distribution. The improved question recommendation model and the offline training system have been launched to provide a more accurate personalized question recommendation service for the users of the interactive Chinese question answering system.
【学位授予单位】：中山大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP181;TP391.3

【共引文献】