基于历史上下文挖掘的“科技论文在线”用户行为研究

发布时间：2018-03-30 22:08

本文选题：上下文　切入点：web日志　出处：《武汉理工大学》2013年硕士论文

【摘要】：“中国科技论文在线”是由教育部科技发展中心主办,以“阐述学术观点、保护知识产权、思想交流创新、论文快捷共享”为宗旨,为科研人员提供一个方便、快捷的交流的学术平台,以此平台为基础实现新成果的及时推广,科研创新思想的及时交流。作为一个信息获取类的网站,在它快捷、方便地带来大量信息的同时,也带来了许多难题：如何能使用户快速、准确地获得所需要的科研信息；如何理解已有的用户历史数据并用于预测用户未来的行为等。对于“科技论文在线”用户行为的研究可以有效地解决这些问题。在分析历史上下文信息与web信息各自的优缺点后,将历史上下文信息与web日志进行融合,融合后数据来源更为广泛,能较全面的体现用户访问页面时的环境状况,较准确的反映用户当时的情绪、心理状态,行为特征。在此两类数据基础上进行挖掘分析,可以较准确地得出用户的访问模式和访问特点。本文主要研究了历史上下文信息挖掘过程中的数据获取、融合及预处理的各阶段的算法并进行了部分改进和创新,然后利用改进的聚类分析算法DICA分析预处理得到的会话集,并根据聚类分析结果得出推荐集来实现网站站点结构改善和向用户提供推荐服务。本论文的工作主要集中在四个方面： (1)数据预处理：首先在较为全面的分析了历史上下文信息以及web日志的数据特点后,将多种历史上下文信息和服务器端的web日志进行去噪融合。然后通过会话划分算法将融合后的信息整理为会话集,在此基础上,利用用户访问轨迹重现算法模拟用户当时的访问轨迹,并以此再次细化会话集。最后利用历史上下文信息中的终端环境上下文信息,修正用户每个页面的浏览时间。 (2)页面兴趣度计算：对于得到的会话集,采用基于多特征的页面兴趣度计算方法为每个页面赋权重值。针对以往权重计算算法中,不能体现用户浏览页面顺序的问题,本文提出了将会话中页面的序号作为一个特征加入页面权重的计算,有效地区分了多个用户采用不同的顺序访问某些特定页面的情况。 (3)聚类分析用户行为：在对会话集中的页面赋值权重后,本文提出改进的k-means算法DICA。算法的自动获取最优聚类个数和初始聚类中心的特点有效的避免了k-means算法中需要依据经验设定初始聚类个数和随机设定初始聚类中心的缺陷。 (4)生成推荐集：对带权重的会话集进行DICA算法聚类分析后得到基于群体用户的推荐集和基于个体用户的推荐集,并将这两个推荐集融合,以此来改善网站站点结构和向用户提供推荐服务。本文的研究工作得到教育部项目“基于上下文感知的“中国科技论文在线”用户行为研究”(项目编号：20121140004)的资助。
[Abstract]:"China Science and Technology Paper online" is sponsored by the Science and Technology Development Center of the Ministry of Education. It aims at "expounding academic viewpoints, protecting intellectual property rights, exchanging ideas and innovating, and sharing papers quickly."Rapid exchange of academic platform, based on this platform to achieve the timely promotion of new results, scientific research and innovation ideas timely exchange.As a website of information acquisition class, it brings a lot of information quickly and conveniently, but also brings a lot of difficulties: how to make users get the needed scientific research information quickly and accurately;How to understand the existing user history data and to predict the future behavior of the user.The research on online user behavior of scientific papers can solve these problems effectively.After analyzing the advantages and disadvantages of the historical context information and the web information, the historical context information and the web log are fused.More accurate reflection of the user's mood, psychological state, behavioral characteristics.On the basis of mining and analysis of these two kinds of data, the user's access pattern and access characteristics can be obtained more accurately.In this paper, we mainly study the algorithms of data acquisition, fusion and preprocessing in the process of historical context information mining, and make some improvements and innovations. Then we use the improved clustering analysis algorithm DICA to analyze the session set obtained by preprocessing.According to the result of clustering analysis, the recommendation set is obtained to improve the site structure and provide recommendation services to users.The work of this thesis is mainly focused on four aspects:(1) data preprocessing: firstly, after analyzing the historical context information and the data characteristics of the web log, we combine the historical context information with the web log on the server side.Then the fused information is arranged into a session set by the session partition algorithm. On this basis, the user access trajectory reconstruction algorithm is used to simulate the access trajectory of the user at that time, and then refine the session set again.Finally, the user browsing time of each page is corrected by using the terminal environment context information in the historical context information.Page interest calculation: for the resulting session set, the multi-feature based page interest calculation method is used to assign a weight value to each page.In order to solve the problem that the order of page browsing can not be reflected in the previous algorithms of weight calculation, this paper proposes to add the ordinal number of the page to the calculation of the weight of the page as a feature.Effectively distinguishes multiple users from accessing certain pages in different order.Cluster analysis of user behavior: after assigning weights to pages in session sets, an improved k-means algorithm, DICA, is proposed in this paper.The characteristics of the algorithm to obtain the optimal number of clusters and the initial clustering centers effectively avoid the defects of k-means algorithm which needs to set the number of initial clusters and the random setting of initial clustering centers according to experience.(4) generating recommendation set: the weighted session set is analyzed by DICA algorithm, then the recommendation set based on group users and the recommendation set based on individual user are obtained, and the two recommendation sets are fused.In order to improve the site structure and provide users with referral services.The research of this paper is supported by the Ministry of Education project "Context-aware" online "user behavior Research" (Project No.: 20121140004).
【学位授予单位】：武汉理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP311.13

【参考文献】