当前位置:主页 > 科技论文 > 软件论文 >

基于Eucalyptus的Hadoop集群web日志分析系统的设计与实现

发布时间:2018-06-18 08:44

  本文选题:云计算 + Eucalyptus ; 参考:《北京邮电大学》2016年硕士论文


【摘要】:随着互联网的飞速发展,Web日志的数量也变得越来越多,而Web日志中含有许多信息。通过分析日志可以获取企业有价值的信息。针对目前数据量越来越多的Web日志,传统的单机分析处理能力已经达到了瓶颈。数据量一旦超过一定的大小,传统的依靠单一节点的计算能力以及不能满足需求。本文设计了基于Eucalyptus的Hadoop集群的Web日志分析系统。并实现了该系统。该系统利用云计算和分布式技术来分析和处理大规模的Web日志。测试结果表明,该系统可以大大提高系统的计算能力和运行速度。首先,搭建了 Eucalyptus私有云平台。结合Eucalyptus云平台方便快速创建虚拟机和Hadoop集群分布式处理的优点,将Hadoop集群部署在Eucalyptus云平台上。其次,使用MapReduce程序对某在线教育网站Web日志进行分析处理。得到网站的相关指标比如访客数、浏览量、IP数、跳出率、平均访问时长、流量来源、受访页面等,并且将分析结果通过可视化的形式呈现给用户。另外,论文还使用改进的并行化Apriori算法对Web日志进行了关联规则挖掘,得到网站各个页面之间的相关性。网站管理和运营人员可以通过日志分析结果指标更好的了解网站。根据分析结果对网站结构进行调整,实施有效的市场推广战略,对用户进行个性化推荐等等。最后对分布式环境和单机环境分析日志性能进行了测试比较。结果表明分布式环境下处理大量Web日志数据的性能远远高于单机环境。并对改进的并行化的Apriori算法和单机的Apriori进行了测试比较。结果表明改进的并行化Apriori算法在运行时间、CPU和内存利用率上有更好的性能。
[Abstract]:With the rapid development of the Internet, the number of Web logs has become more and more, and the Web log contains a lot of information. Through the analysis of logs, the value of information can be obtained. For the more and more Web logs of the current data amount, the traditional single machine analysis processing capacity has reached the bottleneck. Once the amount of data is more than a certain size, This paper designs the Web log analysis system of Hadoop cluster based on Eucalyptus and implements the system. The system uses cloud computing and distributed technology to analyze and process large-scale Web logs. The test results show that the system can greatly improve the system. First, the Eucalyptus private cloud platform is built. Combined with the Eucalyptus cloud platform, the advantages of creating virtual machines and Hadoop cluster distributed processing are convenient and fast, and the Hadoop cluster is deployed on the Eucalyptus cloud platform. Secondly, the MapReduce program is used to analyze and process the Web log of an online education website. The relevant index of the station, such as the number of visitors, the amount of browsing, the IP number, the jump out rate, the average time of the visit, the source of the traffic, the page of the interview, etc., and the analysis results are presented to the users through the visual form. In addition, the paper also uses an improved parallel Apriori algorithm to mining the association rules for the Web log, and gets the phase between the pages of the web site. The website management and operators can understand the website better through the log analysis results. According to the results of the analysis, the website structure is adjusted, the effective marketing strategy is implemented, the user is personalized recommendation and so on. Finally, the performance of the distributed environment and the single machine environment analysis log is tested and compared. The performance of a large number of Web log data in the distributed environment is much higher than that in the single machine environment. The improved parallel Apriori algorithm and the single machine Apriori are tested and compared. The results show that the improved parallel Apriori algorithm has better performance in running time, CPU and memory utilization.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13;TP393.09

【参考文献】

相关期刊论文 前5条

1 刘永增;张晓景;李先毅;;基于Hadoop/Hive的web日志分析系统的设计[J];广西大学学报(自然科学版);2011年S1期

2 孙健;贾晓菁;;Google云计算平台的技术架构及对其成本的影响研究[J];电信科学;2010年01期

3 郭本俊;王鹏;陈高云;黄健;;基于MPI的云计算模型[J];计算机工程;2009年24期

4 宋擒豹,沈钧毅;Web日志的高效多能挖掘算法[J];计算机研究与发展;2001年03期

5 王继成,潘金贵,张福炎;Web文本挖掘技术研究[J];计算机研究与发展;2000年05期

相关硕士学位论文 前2条

1 宁立;基于数据仓库的Web日志挖掘研究与应用[D];湖北大学;2012年

2 邓自立;云计算中的网络拓扑设计和Hadoop平台研究[D];中国科学技术大学;2009年



本文编号:2034877

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2034877.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户fba3e***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com