基于Eucalyptus的Hadoop集群web日志分析系统的设计与实现

发布时间：2018-06-18 08:44

本文选题：云计算 + Eucalyptus　；参考：《北京邮电大学》2016年硕士论文

【摘要】：随着互联网的飞速发展,Web日志的数量也变得越来越多,而Web日志中含有许多信息。通过分析日志可以获取企业有价值的信息。针对目前数据量越来越多的Web日志,传统的单机分析处理能力已经达到了瓶颈。数据量一旦超过一定的大小,传统的依靠单一节点的计算能力以及不能满足需求。本文设计了基于Eucalyptus的Hadoop集群的Web日志分析系统。并实现了该系统。该系统利用云计算和分布式技术来分析和处理大规模的Web日志。测试结果表明,该系统可以大大提高系统的计算能力和运行速度。首先,搭建了 Eucalyptus私有云平台。结合Eucalyptus云平台方便快速创建虚拟机和Hadoop集群分布式处理的优点,将Hadoop集群部署在Eucalyptus云平台上。其次,使用MapReduce程序对某在线教育网站Web日志进行分析处理。得到网站的相关指标比如访客数、浏览量、IP数、跳出率、平均访问时长、流量来源、受访页面等,并且将分析结果通过可视化的形式呈现给用户。另外,论文还使用改进的并行化Apriori算法对Web日志进行了关联规则挖掘,得到网站各个页面之间的相关性。网站管理和运营人员可以通过日志分析结果指标更好的了解网站。根据分析结果对网站结构进行调整,实施有效的市场推广战略,对用户进行个性化推荐等等。最后对分布式环境和单机环境分析日志性能进行了测试比较。结果表明分布式环境下处理大量Web日志数据的性能远远高于单机环境。并对改进的并行化的Apriori算法和单机的Apriori进行了测试比较。结果表明改进的并行化Apriori算法在运行时间、CPU和内存利用率上有更好的性能。
[Abstract]:With the rapid development of the Internet, the number of Web logs has become more and more, and the Web log contains a lot of information. Through the analysis of logs, the value of information can be obtained. For the more and more Web logs of the current data amount, the traditional single machine analysis processing capacity has reached the bottleneck. Once the amount of data is more than a certain size, This paper designs the Web log analysis system of Hadoop cluster based on Eucalyptus and implements the system. The system uses cloud computing and distributed technology to analyze and process large-scale Web logs. The test results show that the system can greatly improve the system. First, the Eucalyptus private cloud platform is built. Combined with the Eucalyptus cloud platform, the advantages of creating virtual machines and Hadoop cluster distributed processing are convenient and fast, and the Hadoop cluster is deployed on the Eucalyptus cloud platform. Secondly, the MapReduce program is used to analyze and process the Web log of an online education website. The relevant index of the station, such as the number of visitors, the amount of browsing, the IP number, the jump out rate, the average time of the visit, the source of the traffic, the page of the interview, etc., and the analysis results are presented to the users through the visual form. In addition, the paper also uses an improved parallel Apriori algorithm to mining the association rules for the Web log, and gets the phase between the pages of the web site. The website management and operators can understand the website better through the log analysis results. According to the results of the analysis, the website structure is adjusted, the effective marketing strategy is implemented, the user is personalized recommendation and so on. Finally, the performance of the distributed environment and the single machine environment analysis log is tested and compared. The performance of a large number of Web log data in the distributed environment is much higher than that in the single machine environment. The improved parallel Apriori algorithm and the single machine Apriori are tested and compared. The results show that the improved parallel Apriori algorithm has better performance in running time, CPU and memory utilization.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13;TP393.09

【参考文献】