基于个性化预测的推送算法研究

发布时间：2018-10-08 21:56

【摘要】：高效、准确的从海量信息与数据中筛选可信度高、用户感兴趣的关键信息是信息服务行业的研究重点之一。基于搜索引擎的拉取服务方式与信息推送服务是当前获取信息的两个主要渠道。我国农村地区经济发展水平落后,农民文化素质普遍偏低,采取基于搜索引擎的信息获取方式并不现实,信息推送服务更加适合于农村地区。 “个性化”是推送模型的根本出发点,通过选择距离最近的K个邻居样本并构建预测模型,实现为目标用户推送特定信息的目的。实现K近邻选择,样本相似性度量与K值大小确定是其关键与难点。本研究从以上两个方面出发,并对其进行改进,报告结果如下。构建推送模型,首先需要为目标用户选择一个近邻集合,该集合由相似性测度最高的K个用户样本组成,常用的相似性测度有Pearson相关系数、cosine相似性和均方差相似性(Mean Squared Differences, MSD),欧氏距离等,但上述关系测度不能反映两个用户之间复杂非线性关系,导致近邻集合不够准确。本文引入最大互信息系数(maximal mutual information coefficient,MIC)作为用户之间的相似性测度。相比传统互信息,MIC通过对变量划分超簇,并基于逐步寻优获得每个变量的最优分段点,从而最大化两个变量的互信息,适于任意形式的非线性函数甚至叠加函数,可有效反应两个用户之间的复杂非线性关系,使得近邻集合更加准确,提高推送模型的预测精度。基于近邻集合对目标用户未评分项目实施预测(项目评分预测模型),是推送模型的另一个关键点,项目的预测得分值直接决定是否将该项目推送给目标用户,错误的预测值可导致错误的信息推送。构建高精度的项目评分预测模型,选择合适的训练样本是关键。近邻集合是基于全部已评分项目计算相似性获得,但在预测某一特定用户的特定项目时,因时间差异、地域差异、文化差异等的存在,以全部的近邻样本作为训练样本不一定能获得最佳预测效果。从全部的近邻集合中选择k个最优样本是一个k-近邻选择问题,k值的选择是核心。本研究引入地统计学,分析每一个待预测项目的近邻集合的结构性,给出一个公用的变程a,并为每个用户从全部近邻集合中选择距离小于a的k个训练样本,实现了每个用户的个性化预测。基于上述近邻选择与训练样本选择两部分的改进,以MovieLens评分数据集为实例数据,基于支持向量机构建项目评分预测模型,大幅度提高了项目评分的预测精度。
[Abstract]:It is one of the key research points of information service industry to screen the key information of high reliability and interest from mass information and data efficiently and accurately. Search engine based pull service and information push service are the two main channels to obtain information. The level of economic development in rural areas in China is backward and the cultural quality of farmers is generally low. It is not realistic to adopt the way of obtaining information based on search engine, and the information push service is more suitable for rural areas. "Personalization" is the basic starting point of push model. By selecting K nearest neighbor samples and constructing prediction model, the purpose of pushing specific information for target users is realized. It is a key and difficult point to realize K-nearest neighbor selection, measure similarity of samples and determine the size of K-value. This study starts from the above two aspects and improves them. The results are as follows. In order to construct the push model, we first need to select a nearest neighbor set for the target user, which is composed of K user samples with the highest similarity measure. The commonly used similarity measures include Pearson correlation coefficient similarity and (Mean Squared Differences, MSD), Euclidean distance, but the above relation measures can not reflect the complex nonlinear relationship between two users, which leads to the inaccuracy of the nearest neighbor set. In this paper, the maximum mutual information coefficient (maximal mutual information coefficient,MIC) is introduced as the similarity measure between users. Compared with traditional mutual information mics, by dividing superclusters of variables and obtaining the optimal piecewise points of each variable based on stepwise optimization, this paper maximizes the mutual information of two variables and is suitable for any form of nonlinear function or even superposition function. It can effectively reflect the complex nonlinear relationship between two users, make the nearest neighbor set more accurate, and improve the prediction accuracy of the push model. It is another key point of the push model to predict the target user's ungraded items based on the nearest neighbor set. The prediction score of the project directly determines whether to push the project to the target user. An incorrect prediction can cause the wrong message to be pushed. It is crucial to construct a high-precision project score prediction model and select suitable training samples. The nearest neighbor set is based on the similarity calculation of all the graded items, but in predicting a particular item for a particular user, due to the existence of time differences, regional differences, cultural differences, etc. Using all nearest neighbor samples as training samples is not always the best prediction result. The selection of k optimal samples from all nearest neighbor sets is the core of a k-nearest neighbor selection problem. In this study, geostatistics is introduced to analyze the structure of the nearest neighbor set of each item to be predicted, a common variable range a is given, and k training samples with a distance less than a are selected for each user. The personalized prediction of each user is realized. Based on the improvement of neighbor selection and training sample selection, MovieLens score data set is taken as an example, and a project score prediction model based on support vector mechanism is built, which greatly improves the prediction accuracy of item score.
【学位授予单位】：湖南农业大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.3

【参考文献】