基于Hadoop的Web日志存储及预处理优化研究

发布时间：2018-01-02 02:36

本文关键词：基于Hadoop的Web日志存储及预处理优化研究　出处：《河北工程大学》2016年硕士论文　论文类型：学位论文

【摘要】：互联网、移动互联网等技术的发展,使得服务器上的Web日志急剧膨胀。Web日志记录了上网用户访问Web页面的浏览行为,对网站建设和提供精准服务具有重要的指导意义。但是,原始Web日志文件中数据的通常是不完整、冗余甚至错误的,直接使用这些数据进行日志分析非常困难,而且有可能得到错误的结果,因此,对Web日志数据进行预处理是很有必要的。同时,考虑到传统关系数据库存储的约束和单节点数据处理方式的局限性,本文使用Hadoop的分布式处理平台对Web日志数据进行存储和预处理操作,主要内容包括:(1)Web日志数据存储面对海量Web日志的急剧增长,传统存储技术面临建设成本高、运维复杂、扩展性有限等问题,而现在流行的云数据库具有动态可扩展、高伸缩性、高吞吐性能、低成本等优势,因此,本课题考虑将Web日志存储到Hadoop数据库HBase中,充分利用集群的分布式处理优势。(2)HBase负载均衡优化数据在HBase中的存储方式在很大程度上左右着整个集群的性能,直接影响着后续读取操作的效率。当MapReduce读取HBase中Web日志数据时可能会造成访问“热点”问题,本文针对这种情况提出一种改进的负载均衡算法即HBase基于子表限制的负载均衡算法,在子表分配过程中除了考虑HRegionServer的负载情况外,还考虑到切割子表region的分配情况,从而实现最大程度上的集群负载均衡。(3)用MapReduce对Web日志进行预处理Web日志预处理操作关系到Web挖掘的质量,而单一节点的计算能力在处理大规模增长的Web日志上逐渐显露出弊端,MapReduce支持大规模集群操作,本文在分析Web日志预处理过程后,从HBase中读取数据,使用MapReduce计算模型处理Web日志的预处理操作。通过对比实验,验证了优化后的HBase负载均衡算法在适当集群环境中可以有效解决负载访问失衡问题,以及验证了MapReduce在处理Web日志预处理过程的高效性。最后,本文对预处理算法进行优化,并验证优化后算法的高效性。
[Abstract]:With the development of Internet, mobile Internet and other technologies, the Web log on the server expands rapidly. The web log records the browsing behavior of the users accessing the Web page. It has important guiding significance for website construction and providing accurate service. However, the data in the original Web log file is usually incomplete, redundant and even wrong. It is very difficult to use this data directly for log analysis, and it is possible to get wrong results, so it is necessary to preprocess the Web log data. At the same time. Considering the constraints of traditional relational database storage and the limitation of single node data processing, this paper uses the distributed processing platform of Hadoop to store and preprocess the Web log data. The main contents include the rapid growth of the mass Web log data storage and the problems of the traditional storage technology such as high construction cost, complex operation and maintenance, limited expansibility and so on. Now the popular cloud database has the advantages of dynamic extensibility, high scalability, high throughput, low cost and so on. Therefore, this paper considers storing Web logs in Hadoop database HBase. Taking full advantage of the distributed processing advantage of cluster, the storage mode of optimized data of HBASE load balance in HBase greatly affects the performance of the whole cluster. It directly affects the efficiency of subsequent read operations. When MapReduce reads Web log data in HBase, it may cause access "hot spot" problems. In this paper, an improved load balancing algorithm named HBase based on sub-table constraints is proposed. In addition to considering the load of HRegionServer, the distribution of region in cutting subtable is also considered in the process of subtable allocation. Thus, to achieve maximum cluster load balancing. 3) using MapReduce to preprocess Web logs, Web log preprocessing operations are related to the quality of Web mining. However, the computing power of a single node has gradually revealed its disadvantages in dealing with large-scale Web logs. MapReduce supports large-scale cluster operations. This paper analyzes the preprocessing process of Web logs. Read the data from HBase, use the MapReduce computing model to deal with the pre-processing operation of Web log. It is verified that the optimized HBase load balancing algorithm can effectively solve the load access imbalance problem in the appropriate cluster environment. Finally, this paper optimizes the preprocessing algorithm and verifies the efficiency of the optimized algorithm.
【学位授予单位】：河北工程大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】