基于Hadoop平台的分布式web日志分析系统的研究与实现
[Abstract]:With the progress of science and technology and the rapid development of the Internet, the Internet and people's lives are more and more closely linked. Web sites running on the Internet generate a lot of log information every day, and people's access records are kept in web logs. The analysis of log data becomes an important means to understand the website operation, user access rules and other information, mining valuable information is conducive to enterprises to provide users with better and more convenient services. At present, most log analysis systems are single machine. In the face of massive web log data, both performance and storage capacity are not competent. In order to meet the needs of big data's analysis, many data processing schemes have emerged, especially the cloud computing technology represented by Hadoop, and the powerful distributed storage and computing ability, which provides a good platform for the storage and analysis of massive web logs. This paper first introduces the development of distributed technology and describes the background of current web log mining. Then the HDFS and MapReduce,Hive data warehouse, the core components of Hadoop, are studied. The principle of data storage in HDFS distributed file system, the access mode of data, the fault-tolerant mechanism of the system and the programming model of MapReduee parallel computing framework are studied in detail. Then a suitable business data processing model is established for the web log analysis system, and an efficient web log analysis system is designed on the Hadoop platform. The system mainly includes five modules: log storage, log collection, log preprocessing, key index statistics and log mining. Log storage adopts the combination of HDFS and MySQL, and HDFS stores the original log and the cleaned log. Log preprocessing uses MapReduce parallelization to standardize data cleaning with noise. Index statistics using Hive data warehouse HQL script to analyze the operation of the site. Log mining uses the improved K-means algorithm in MapReduce platform to analyze the clustering of registered users, which improves the efficiency of the algorithm in dealing with massive data. Finally, it is proved by system test that the web log analysis system based on Hadoop has great improvement in collection, processing, storage and mining, which not only reduces the workload of developers, but also improves the efficiency of the system.
【学位授予单位】:西南石油大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【参考文献】
相关期刊论文 前10条
1 何非;何克清;;大数据及其科学问题与方法的探讨[J];武汉大学学报(理学版);2014年01期
2 余琦;凌捷;;基于HDFS的云存储安全技术研究[J];计算机工程与设计;2013年08期
3 高洪;杨庆平;黄震江;;基于Hadoop平台的大数据分析关键技术标准化探讨[J];信息技术与标准化;2013年05期
4 周婷;张君瑛;罗成;;基于Hadoop的K-means聚类算法的实现[J];计算机技术与发展;2013年07期
5 孟小峰;慈祥;;大数据管理:概念、技术与挑战[J];计算机研究与发展;2013年01期
6 李国杰;程学旗;;大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考[J];中国科学院院刊;2012年06期
7 李超;梁阿磊;管海兵;李小勇;;海量存储系统的性能管理与监测方法研究[J];计算机应用与软件;2012年07期
8 李建江;崔健;王聃;严林;黄义双;;MapReduce并行编程模型研究综述[J];电子学报;2011年11期
9 刘永增;张晓景;李先毅;;基于Hadoop/Hive的web日志分析系统的设计[J];广西大学学报(自然科学版);2011年S1期
10 张世乐;魏芳;费仲超;;基于代理的互联网用户行为分析研究[J];计算机应用与软件;2011年08期
相关硕士学位论文 前10条
1 蔡大威;基于Hadoop和Hama平台的并行算法研究[D];浙江大学;2013年
2 李鑫;Hadoop框架的扩展和性能调优[D];西安建筑科技大学;2012年
3 周津;互联网中的海量用户行为挖掘算法研究[D];中国科学技术大学;2011年
4 白云龙;基于Hadoop的数据挖掘算法研究与实现[D];北京邮电大学;2011年
5 杨宸铸;基于HADOOP的数据挖掘研究[D];重庆大学;2010年
6 李应安;基于MapReduce的聚类算法的并行化研究[D];中山大学;2010年
7 曾理;Hadoop的重复数据清理模型研究与实现[D];南华大学;2010年
8 张密密;MapReduce模型在Hadoop实现中的性能分析及改进优化[D];电子科技大学;2010年
9 李亭枫;面向网络用户行为模式发现的数据挖掘技术探索[D];电子科技大学;2010年
10 郑韫e,
本文编号:2310163
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2310163.html