基于Hadoop平台的分布式web日志分析系统的研究与实现

发布时间：2018-11-04 14:52

【摘要】：伴随科技进步以及互联网日新月异的发展,互联网与人们的生活联系的越来越紧密。运行于互联网的网站每天会产生大量日志信息,人们的访问记录都保存在web日志中。分析日志数据成为了解网站运营情况、用户访问规律等信息的重要手段,挖掘其中有价值的信息有利于企业为用户提供更好更方便的服务。目前多数日志分析系统还是单机的,面对海量web日志数据,无论是性能还是存储容量都无法胜任。为了满足大数据分析的需求,涌现了很多的数据处理方案,尤其是以Hadoop为代表的云计算技术,强大的分布式存储及计算能力,为海量web日志的存储及分析提供了很好平台。本文首先介绍了分布式技术的发展状况,同时对当前web日志挖掘的背景做了描述。然后对Hadoop核心组件HDFS和MapReduce,Hive数据仓库进行研究。深入研究了 HDFS分布式文件系统下数据的存储原理,数据的访问模式和系统的容错机制和MapReduee并行计算框架的编程模型。然后为web日志分析系统建立合适的业务数据处理模型,并在Hadoop平台上设计高效的web日志分析系统。系统主要包括日志存储、日志收集、日志预处理、关键指标统计、日志挖掘五个模块。日志存储采用HDFS与MySQL相结合的方式,HDFS存储原始日志以及清洗后的日志。日志的预处理采用MapReduce并行化的方式对包含噪声的数据清洗标准化。指标统计使用Hive数据仓库的HQL脚本对网站运营情况进行分析。日志挖掘使用在MapReduce平台改进的K-means算法对注册用户聚类分析,提高了算法在处理海量数据时的效率。最后通过系统测试证明,基于Hadoop的web日志分析系统在收集、处理、存储、挖掘方面相比传统单机处理有很大改进,不仅减少了开发人员工作量同时还提高了系统效率。
[Abstract]:With the progress of science and technology and the rapid development of the Internet, the Internet and people's lives are more and more closely linked. Web sites running on the Internet generate a lot of log information every day, and people's access records are kept in web logs. The analysis of log data becomes an important means to understand the website operation, user access rules and other information, mining valuable information is conducive to enterprises to provide users with better and more convenient services. At present, most log analysis systems are single machine. In the face of massive web log data, both performance and storage capacity are not competent. In order to meet the needs of big data's analysis, many data processing schemes have emerged, especially the cloud computing technology represented by Hadoop, and the powerful distributed storage and computing ability, which provides a good platform for the storage and analysis of massive web logs. This paper first introduces the development of distributed technology and describes the background of current web log mining. Then the HDFS and MapReduce,Hive data warehouse, the core components of Hadoop, are studied. The principle of data storage in HDFS distributed file system, the access mode of data, the fault-tolerant mechanism of the system and the programming model of MapReduee parallel computing framework are studied in detail. Then a suitable business data processing model is established for the web log analysis system, and an efficient web log analysis system is designed on the Hadoop platform. The system mainly includes five modules: log storage, log collection, log preprocessing, key index statistics and log mining. Log storage adopts the combination of HDFS and MySQL, and HDFS stores the original log and the cleaned log. Log preprocessing uses MapReduce parallelization to standardize data cleaning with noise. Index statistics using Hive data warehouse HQL script to analyze the operation of the site. Log mining uses the improved K-means algorithm in MapReduce platform to analyze the clustering of registered users, which improves the efficiency of the algorithm in dealing with massive data. Finally, it is proved by system test that the web log analysis system based on Hadoop has great improvement in collection, processing, storage and mining, which not only reduces the workload of developers, but also improves the efficiency of the system.
【学位授予单位】：西南石油大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】