基于云计算的日志挖掘系统设计与实现

发布时间：2018-10-07 19:54

【摘要】：随着社会信息化进程的不断加快，信息量不可避免呈现出一个爆炸式增长的趋势。如何有效应对由此产生的海量数据存储与计算的挑战，使得云计算成为解决这一难题的一个重要手段。基于云计算的日志挖掘系统利用云计算的方法，通过分析和挖掘搜索引擎的海量用户日志，对其进行复杂的多维度映射和交叉计算，转化为数据仓库中各维度统计数据，搭建起了数据挖掘的平台。得到的搜索引擎网站的十三个具体流量指标，能通过网站流量的变化，为网站运营提供分析的基础，以及为产品、业务、决策做支撑。按软件工程的方法，首先对系统进行了业务和需求分析，明确了日志挖掘系统的四项功能需求。然后进行了系统的总体设计，给出了系统的流程框架，提出了将系统分为日志预处理、日志分析统计作业、联机分析处理三个模块来进行设计与实现。在系统设计中分别对各个数据模型、XML配置、维度和事实表以及维度映射和交叉规则的设计做了详细的分析说明。在系统的实现部分，给出了日志数据装载过程、ETL过程的实现，维度解析器和各个指标算法的实现，以及数据仓库对多维交叉分析的解决方案的实现。特别是对基于Hadoop云计算的指标算法实现给出了详细的实现流程。通过对云计算技术、Hadoop的Map/Reduce编程框架、数据挖掘以及数据仓库的联机分析处理等相关知识的应用，，给出了一个基于云计算的日志挖掘系统的开发实例。
[Abstract]:With the rapid development of social informatization, the amount of information inevitably presents a trend of explosive growth. How to effectively deal with the challenges of massive data storage and computing makes cloud computing an important means to solve this problem. The log mining system based on cloud computing uses the method of cloud computing, through analyzing and mining the massive user log of search engine, carries on the complex multi-dimensional mapping and cross calculation to it, and transforms it into the statistical data of each dimension in the data warehouse. Set up the platform of data mining. The 13 specific traffic indexes of the search engine website can provide the basis for the analysis of the website operation, as well as the support for the product, business and decision making through the change of the website traffic. According to the method of software engineering, the business and requirement of the system are analyzed firstly, and the four functional requirements of log mining system are clarified. Then the overall design of the system is carried out, the system flow framework is given, and the system is divided into three modules: log preprocessing, log analysis and statistics, on-line analysis and processing. In the system design, the XML configuration of each data model, the dimension and fact table, the design of dimension mapping and cross rules are analyzed in detail. In the implementation of the system, the implementation of the log data loading process and ETL process, the implementation of dimension parser and each index algorithm, and the solution of data warehouse to multidimensional cross analysis are given. In particular, the implementation process of index algorithm based on Hadoop cloud computing is given in detail. Through the application of Map/Reduce programming framework of cloud computing technology, data mining and on-line analytical processing of data warehouse, a development example of log mining system based on cloud computing is given.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】