个性化信息推荐中若干关键问题与技术研究

发布时间：2018-06-19 19:18

本文选题：个性化信息推荐 + 评分预测　；参考：《国防科学技术大学》2014年博士论文

【摘要】：互联网技术的飞速发展与信息网络化趋势的蔓延使得互联网上信息的数量快速膨胀,人们面临着信息过载带来的信息获取方面的困难。如何帮助互联网用户更加有效地获取自己想要的信息,成为信息科学、计算机科学与网络科学等交叉领域的研究热点。得益于众多研究人员的不懈努力,当前已经有了几种可以比较高效地获取感兴趣的信息的方式,最主要的是信息检索技术和信息过滤技术,前者以各种搜索引擎为典型代表,通过与用户的交互获取用户对目标信息的描述,通过描述关键词在网络中进行查找;后者以信息推荐为主要方法,通过收集用户的行为数据和其他属性信息,分析用户的潜在兴趣,为用户筛选可能感兴趣的信息。搜索技术需要用户提供尽可能明确的关键词来描述自己的需求,并且有限的关键词无法进一步区分具有不同习惯的用户,得到的结果都是相同的;而推荐技术使用用户的有关信息以及其过往行为所代表的兴趣分析得到用户的偏好与倾向,并不以用户需求的自我描述为前提,所以用户可以以较少的付出得到更精准的信息。因此,对于没有明确需求的情况,推荐技术可以很好地满足用户的需求。推荐技术已经发展了近二十年,在很多领域已经取得了较为成功的应用,在理论研究方面,推荐技术得到了大量研究人员的关注,对经典推荐方法——比如协同过滤方法——的研究热度不减,还有很多其他的新方法——比如基于二分网络的方法——被不断提出,进一步丰富了推荐技术的相关研究。随着研究的不断深入以及应用环境的持续变化,推荐技术面临着不少问题与挑战,这其中最主要的就是数据稀疏性问题与大规模数据处理问题。数据稀疏性问题指的是基于协同过滤的推荐中用户与项目数量规模较大,但是用户对项目的评价数据相对较少,导致整个用户-项目矩阵中的评分数据十分稀疏,给推荐方法的计算带来准确性方面的影响。大规模数据处理问题是指随着实际应用中推荐技术要处理的数据量的不断增大,推荐算法的实时性压力越来越大,这就要求设计更加高效的方法或者提出其他提高算法执行效率的方法,提升推荐算法对数据的处理能力与处理速度。针对推荐技术面临的以上主要挑战,本文将对下面几个问题展开研究。第一,基于协同过滤方法的评分预测中数据稀疏性问题研究。评分预测是个性化信息推荐的一个主要研究内容,通过分析用户以往评分来预测未评分的项目的评分值。数据稀疏性问题对协同过滤算法的影响主要体现在用户相似度计算与评分预测生成两个阶段,数据稀疏导致用户之间的公共数据变得更加有限,使得用户之间相似结果的可信度下降;而受稀疏性的影响近邻的评分完整性无法保证,在不完整参考评分集上得到的评分预测值也就不能保证较高的准确度。因此,提出了基于绝对相似度度量进行参考用户(项目)选择和利用跨维度填补方法提高参考评分集完整性的方法。实验结果验证了本文提出的算法在减少数据稀疏性影响并提高推荐准确性方面的作用。第二,基于二分网络的top-n推荐中数据稀疏性问题研究。Top-n推荐是个性化信息推荐中的另一个基本问题,目的是向每个用户提供一个包含N个项目的推荐列表。二分网络的推荐方法是一种比较新颖的方法,这类方法能够更好地适应比较稀疏的数据,并且可以获得更高的推荐精度。以用户评分为依据划分用户兴趣时,只考虑用户喜欢的项目部分使得数据利用率很低,而对用户不喜欢的项目部分利用的不够;用户评分反映的兴趣差别不仅应该体现在兴趣的有无上,还应该进一步细化到兴趣强度的差异上以及兴趣资源转移过程中。本文提出了一种新的二分网络方法,通过分析用户不喜欢的项目所透露出来的信息建立负兴趣感知的用户兴趣模型,并且使用评分敏感的用户兴趣资源初始化方法与资源转移方法来体现用户兴趣在程度上的不同。接下来的实验表明,使用本文提出的新方法,推荐的效果取得了明显的提高。第三,基于二分网络的评分预测算法研究。针对节点度分布不均衡的数据,提出一种二分网络上无偏温差传导和有偏温度恒定的算法处理评分预测问题。由于不需要进行相似计算和选择固定个数用户(项目)作为近邻,二分网络的方法可以更好地缓解稀疏数据的影响。本文提出的算法基于热传导的过程,并采用用户之间的温差作为传导与比较的内容,并设定节点获得的温差是从所有连接节点处传导过来的温差的均值,以此平衡所有节点的影响;此外,利用温度恒定的过程计算项目节点的预测温度,得到用户对项目的评分预测值。由文中进行的实验可知,在特定类型的数据集上,本文提出的算法可以取得比基于协同过滤的方法更好的效果,并且该算法比经典热传导方法具有更高的计算效率。第四,基于Mapreduce的评分预测与top-n推荐算法的大规模数据处理问题研究。个性化信息推荐在实际应用中要处理的数据量越来越大,因此对算法的执行效率提出了更高的要求。有些研究针对算法计算过程进行精简,比如矩阵降维等,但这类方法受限于算法本身,并不能保证精简的效果一定能够满足要求,也不能无限地精简来提升算法的扩展能力。本文研究了所提出的几种推荐算法,对基于二分网络的top-n推荐算法与评分预测算法进行并行化设计与实现,利用Mapreduce的并行计算功能将整个算法的计算量分配到多个计算节点上并发进行,以此提高算法的执行效率,减少处理大规模数据时算法的时间消耗。这类方法的好处是,随着数据量的不断加大,在算法适用的前提下,只要提供足够的计算节点分担计算量,就可以不断增加其扩展能力。
[Abstract]:With the rapid development of Internet technology and the spread of information network, the number of information on the Internet is expanding rapidly. People are faced with the difficulty of obtaining information from information overload. How to help Internet users get more information they want more effectively, become information science, computer science and network science and so on Thanks to the unremitting efforts of many researchers, there have been several ways to obtain information more efficiently, the most important is the information retrieval technology and information filtering technology. The former takes various search engines as the typical representative, and gets the user's information to the user through the interaction with the user. By describing the key words in the network, the latter uses the information recommendation as the main method to analyze the user's potential interest by collecting the user's behavior data and other attribute information, and screening the information that may be interested in the user. And the limited key words can not further distinguish the users with different habits, and the results are all the same; and the recommendation technology uses the information of the user and the interest analysis represented by the past behavior to get the user's preference and tendency, which is not based on the self description of the user's needs, so the user can pay less. The recommendation technology has been developed for nearly twenty years and has achieved more successful applications in many fields. In the field of theoretical research, the recommendation technology has been paid attention by a large number of researchers and the classic recommendation. There are many other new methods, such as collaborative filtering, and many other new methods - such as the two - Network - based approach - have been put forward to further enrich the related research of recommendation technology. With the deepening of the research and the continued changes in the application environment, the recommendation technology is facing many problems and challenges. The most important of these is data sparsity and large-scale data processing. Data sparsity refers to the large number of users and projects in the recommendation based on collaborative filtering, but the user's evaluation data on the project is relatively small, which leads to the sparse data in the entire user item matrix. The calculation of the method brings about the effect of accuracy. The problem of large-scale data processing is that the real-time pressure of the recommended algorithm is increasing with the increasing of the amount of data to be processed in the practical application. This requires the design of more efficient methods or other methods to improve the efficiency of the algorithm to improve the recommendation. The processing ability and speed of data processing. In view of the main challenges facing recommendation technology, this paper will study the following problems. First, research on data sparsity in scoring prediction based on collaborative filtering method. The impact of data sparsity on the collaborative filtering algorithm is mainly reflected in the two stages of user similarity calculation and grade prediction generation. Data sparsity leads to more limited public data between users, which reduces the credibility of similar results among users; and sparsity is sparse. The score integrity of the nearest neighbor cannot be guaranteed, and the prediction value obtained on the incomplete reference score set can not guarantee a higher accuracy. Therefore, a method based on the absolute similarity measure to select the reference user (project) and to use the cross dimension filling method to improve the integrity of the reference score set is proposed. The proposed algorithm plays a role in reducing the impact of data sparsity and improving the accuracy of recommendation. Second, data sparsity in the top-N recommendation based on two points network research,.Top-n recommendation is another basic problem in personalized information recommendation. The purpose is to provide each user with a recommendation list containing N items. Two The recommendation method of the sub network is a novel method, which can better adapt to the relatively sparse data and obtain higher recommendation accuracy. When user interest is divided on user score, only the item part of the user's favorite item makes the data use rate very low and the user dislikes the part of the project. It is not enough; the interest difference reflected by the user's score should not only be reflected in the interest, but also should be further refined to the difference of interest intensity and the transfer of interest resources. A new two point network method is proposed in this paper to establish an interest perception by analyzing the information revealed by the items that the user dislikes. The user interest model is used and the user interest resource initialization method and resource transfer method are used to reflect the different degree of user interest. The next experiment shows that the proposed method has been greatly improved by using the new method proposed in this paper. Third, the score prediction algorithm based on the two point network is studied. For unbalanced data of node degree distribution, an algorithm to deal with score prediction with unbiased temperature difference conduction and constant temperature constant on two division networks is proposed. Because no similar calculation and selection of fixed number users (projects) are not needed as close neighbors, the method of two sub network can better alleviate the influence of sparse data. This paper proposes a method proposed in this paper. The algorithm is based on the process of heat conduction, and uses the temperature difference between users as the content of the conduction and comparison, and the temperature difference obtained by the node is the mean of the temperature difference conducted from all connection nodes to balance the influence of all nodes. In addition, the temperature of the node is calculated by the temperature constant of the temperature, and the user is obtained. The experiment in this article shows that the algorithm proposed in this paper can achieve better results than the collaborative filtering method on a specific type of data set, and the algorithm has a higher computational efficiency than the classic heat conduction method. Fourth, the Mapreduce based score prediction and the top-N recommendation algorithm are large Research on model data processing. Personalized information recommendation in the actual application to deal with more and more data, so the efficiency of the algorithm put forward higher requirements. Some of the algorithms to simplify the algorithm calculation process, such as matrix reduction, but this kind of method is limited to the algorithm itself, and can not ensure that the simplified effect is one. In this paper, we studied several proposed algorithms, designed and implemented the top-N recommendation algorithm and the scoring prediction algorithm based on the two sub network, and used the parallel computing power of Mapreduce to allocate the calculation amount of the whole algorithm to a number of meters. In order to improve the efficiency of the algorithm and reduce the time consuming of processing large scale data, the advantage of this kind of method is that as the amount of data is increasing, the expansion ability of the algorithm can be increased by providing enough computing nodes to share the computation.
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP391.3

【参考文献】