基于Web日志挖掘和关联规则的个性化推荐系统模型研究

发布时间：2018-08-17 09:30

【摘要】：随着科学技术的飞速发展,互联网提供的丰富信息在助推社会产业部门升级的同时也带来了一些问题,如信息的急速增长易产生大爆炸效应,造成“信息过载”。同时,为了对互联网用户提供更加全面的信息资源,网站经营者和管理者不断向Web站点中添加信息,这就使得Web站点的拓扑结构日益复杂化。由于向Web站点新添加的资源可能不符合用户的真实需求,易造成用户浏览Web站点时出现“资源迷向”。因此,如何从海量的数据中发现人们感兴趣的信息是我们面临的问题。所以,出现了数据挖掘在Web站点分析中的应用,即Web挖掘。 Web挖掘是一项综合技术,它涉及Web技术、数据挖掘、信息学、计算机语言学等多个领域。Web挖掘可以在很多方面发挥作用,如对搜索引擎的结构进行挖掘,确定权威页面,Web文档分类,Web使用挖掘,智能查询,建立Metaweb数据仓库等。Web使用挖掘就是从服务器日志中发现用户行为特征和导航模式。本文系统阐述了数据挖掘、Web挖掘以及Web使用挖掘的整个流程,重点研究了Web日志预处理过程、关联规则挖掘模型和滑动窗口推荐模型三方面内容。首先,Web日志预处理过程包括：数据清理、用户识别、会话识别、路径补充和事务识别。经过预处理阶段,可以从用户访问信息中去除大量无关的数据,同时也对Internet上的用户访问信息进行结构化处理,并将其以事务或会话的形式保存在关系数据库中。然后,对预处理后的数据,本文采用加权关联规则对其进行挖掘。经典的关联规则挖掘算法Apriori不仅能够发现Web访问页面之间的相互联系,而且对发现用户偏好导航模式有重要作用。但是,将Apriori算法应用于Web日志挖掘也有其主观局限性。Apriori算法隐含的假设是所有页面的重要性是相同的,它并没有考虑到页面之间的差异性,因此,使用该规则挖掘出来的数据中可能会遗漏掉某些用户感兴趣的页面。针对Apriori算法在Web日志挖掘应用中存在的不足,本文引入“页面权值”这一概念,它反映了用户对页面的真实喜好。根据页面权值的定义,我们综合考虑用户对页面的浏览时间和访问频次两个因素,并在此基础上提出了W-Apriori算法。该算法采用扩展布尔矩阵的表示方式来描述事务数据库,这样有助于事务数据库的压缩。同时,权值的引入也有利于区分页面之间的差异,有效地解决了挖掘过程中遗漏某些重要页面的问题。最后,本文将挖掘得到的规则形成规则库,结合使用滑动窗口技术,设计实践基于关联规则挖掘的Web日志推荐模型。该模型不仅能够有效解决“信息过载”和“资源迷向”等问题。而且可以将用户感兴趣的页面推荐给相关Web用户,实现推荐的个性化。
[Abstract]:With the rapid development of science and technology, the rich information provided by the Internet not only promotes the upgrading of social industrial departments, but also brings some problems, such as the rapid growth of information is easy to produce a big bang effect, resulting in "information overload". At the same time, in order to provide more comprehensive information resources for Internet users, website operators and managers constantly add information to Web sites, which makes the topology of Web sites increasingly complex. Because the new resources added to the Web site may not meet the real needs of the user, it is easy to cause a "resource obsessive" when the user browses the Web site. Therefore, how to find the information that people are interested in from the massive data is the problem we face. Therefore, the application of data mining in Web site analysis, that is, Web mining, Web mining is a comprehensive technology, it involves Web technology, data mining, informatics, Web mining can play a role in many aspects, such as mining the structure of search engine, determining the authority page of Web document classification, Web usage mining, intelligent query, etc. Web usage mining, such as establishing Metaweb data warehouse, is to discover user behavior characteristics and navigation patterns from server logs. In this paper, the whole process of data mining and Web usage mining is systematically described, and three aspects of Web log preprocessing process, association rule mining model and sliding window recommendation model are studied. Firstly, the preprocessing process of Web log includes data cleaning, user identification, session identification, path supplement and transaction identification. After preprocessing, a large amount of irrelevant data can be removed from the user access information. At the same time, the user access information on Internet can be structured and stored in the relational database as a transaction or session. Then, this paper uses weighted association rules to mine the preprocessed data. Apriori, a classical association rule mining algorithm, can not only discover the relationship between Web pages, but also play an important role in discovering user preference navigation patterns. However, the application of Apriori algorithm to Web log mining also has its subjective limitations. The implicit assumption of the algorithm is that all pages are of the same importance, and it does not take into account the differences between pages. Some pages of interest to users may be omitted from the data mined using this rule. Aiming at the deficiency of Apriori algorithm in the application of Web log mining, this paper introduces the concept of "page weight", which reflects the users' real preference for pages. According to the definition of page weight, we consider two factors: browsing time and visiting frequency, and then we propose W-Apriori algorithm. The algorithm uses the extended Boolean matrix to describe the transaction database, which is helpful to the compression of the transaction database. At the same time, the introduction of weight also helps to distinguish the differences between pages, and effectively solves the problem of missing some important pages in the process of mining. Finally, this paper designs the Web log recommendation model based on association rule mining by combining the rule base mining and sliding window technology. The model not only can effectively solve the problems of information overload and resource misorientation. And users can be interested in the pages recommended to the relevant Web users, personalized recommendations.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.3

【参考文献】