基于Storm与Hadoop的日志数据实时处理研究

发布时间：2018-04-29 02:17

本文选题：日志数据实时处理 + Hadoop　；参考：《西南大学》2017年硕士论文

【摘要】：日志数据记录着系统与网络用户行为等丰富的信息,在网络管理、用户行为分析等诸多领域具有较高的实用价值。随着大数据时代的来临,单位时间内产生的日志数据规模呈几何级数不断增长,日志数据的多样性、异构性与动态变化给日志数据采集、存储和深入分析提出了挑战。传统的日志处理方式主要是基于单节点服务器,没有扩展性,单节点在CPU、I/O与存储方面的性能,都是十分有限的。当前,在实际应用中对日志数据分析的响应时间要求越来越高,实时性已和针对大数据量的高吞吐率并行计算成为了日志数据处理的基本需求。在实时处理的应用场景中,流式计算处理能完成日志流数据的实时处理,可针对一定时间段内规模不大的数据集完成知识提取,但数据量的局限性限制了可应用的算法和结果的可靠程度,因此,实时计算所提取和依赖的知识亟需与离线批处理技术针对大规模离线数据的分析结果相结合。针对信息化和大数据背景下飞速增长的日志数据的采集、存储和分析面临的主要问题与离线数据与实时流数据的知识提取及其整合问题,通过对大数据技术发展理论和实践成果的研究,在分布式系统基础架构Hadoop上通过Storm On YARN从资源调度层面集成MapReduce和Storm两种不同计算框架构建日志数据实时处理平台,采用Flume与HBase完成日志数据分布式采集与存储,利用吞吐率较高的MapReduce完成大规模离线数据的全局性知识提取,通过Storm进行Kafka缓冲区中小规模数据的突发性知识提取、结合知识进行流数据的实时持续计算,在保证实时性的同时提高准确率。本文主要研究内容与结果如下:(1)日志数据实时处理平台研究研究设计具有3层结构的日志数据实时处理平台架构,包括负责数据采集与存储的数据服务层、负责数据分析的业务逻辑层以及实现数据可视化的Web展示层,其中利用共享知识库实现离线分析与实时分析的结合,并整合Hadoop、Storm、Flume、HBase与Kafka等大数据构件实现整体架构的分布式集群环境搭建。(2)日志数据的分布式采集与存储采用Flume将从多源前端服务器中采集到的日志数据几近实时地存储到分布式数据库HBase,其中采用预分区与RowKey随机散列技术对HBase进行优化。实验结果表明,平台有效完成了前端服务器中日志数据几近实时的采集与存储,经过优化后的HBase在日志存储过程中更加充分的利用集群中的I/O和CPU资源,负载更加均衡,有效解决了HBase的“热点”问题。(3)基于MapReduce的离线日志数据深度分析结合MapReduce计算模型将传统数据挖掘算法进行并行化处理,并将算法移植到平台上执行以实现对HBase中历史日志数据的全局性知识提取并存入离线知识库。并针对实际应用将K-means与Apriori进行并行化处理在MapReduce分布式环境下完成聚类分析与关联规则分析。实验结果表明,实验结果表明平台能有效从历史日志数据中提取出高可靠度的知识,并利用MapReduce并行技术使深度分析获得更高的运行效率与扩展性,充分满足大规模日志数据知识提取的应用需求。(4)基于Storm的日志流数据实时分析整合Storm与Kafka实现实时计算的日志流数据源的稳定接入。将传统数据挖掘算法结合Storm模型完成对一定时间窗口内小规模实时数据的突发性知识提取并存入实时知识库,并以共享知识库中的信息作为决策支持对日志流数据进行Storm实时流式计算,完成离线计算与实时计算的结合。并针对实际应用混合K-means、KNN等多个算法完成网络异常识别。实验结果表明,平台能有效提取出实时数据中的突发性知识,并依赖共享知识库完成高精准度的实时持续计算,Storm技术的应用使得实时分析获得更高的实时性,在流式数据处理方面表现出了较大的优势。综上所述,本研究构建的日志数据实时处理平台有效地解决了日志数据的采集、存储与知识提取等问题,融合了Hadoop与Storm的优势,在利用MapReduce提取隐藏在历史日志数据中的全局性知识的同时,基于Storm提取小规模实时日志数据中的突发性知识、结合提取出的两种知识使用Storm传统流式处理对实时日志流数据进行实时持续计算,可为日志数据采集、存储与分析提供新的技术参考,具有一定的实用和推广价值。
[Abstract]:Log data records the rich information of the behavior of the system and network users. It has high practical value in many fields, such as network management, user behavior analysis and so on. With the advent of the era of large data, the size of log data produced in unit time is growing in geometric progression, the diversity, heterogeneity and dynamic changes of log data are given. Log data collection, storage and in-depth analysis put forward a challenge. The traditional log processing method is mainly based on single node server, without extensibility, the performance of single node in CPU, I/O and storage is very limited. High throughput parallel computing for large data has become the basic requirement of log data processing. In real time application scenarios, streaming computing can complete real-time processing of log data, and can complete knowledge extraction for small data sets within a certain period of time, but the limitation of the amount of data limits the applicable algorithm. And the reliability of the results, therefore, the knowledge extracted and dependent on real-time computing needs to be combined with the analysis results of off-line batch processing for large-scale off-line data. The problem of knowledge extraction and integration, through the research of the development theory and practical results of large data technology, on the distributed system infrastructure Hadoop, two different computing frameworks of MapReduce and Storm are integrated from the resource scheduling level by Storm On YARN to construct log data real-time processing platform, and the log data distribution is completed by Flume and HBase. Type collection and storage, using high throughput MapReduce to complete the global knowledge extraction of large-scale off-line data. Through Storm, the sudden knowledge extraction of small and medium size data in Kafka buffer is extracted, and the real-time continuous calculation of flow data is carried out in combination with knowledge. The results are as follows: (1) the research and research of log data real-time processing platform design and design the log data real-time processing platform architecture with 3 layers of structure, including data service layer responsible for data acquisition and storage, the business logic layer of data analysis and the Web display layer to realize data visualization, in which the shared knowledge base is used to implement off-line analysis and The integration of real-time analysis and integration of large data components such as Hadoop, Storm, Flume, HBase and Kafka to realize the distributed cluster environment of the whole architecture. (2) the distributed collection and storage of log data is stored by Flume from the log data collected from the multi source front-end server to the distributed database HBase, which is used in the distributed database. HBase is optimized by pre partition and RowKey random hash technology. The experimental results show that the platform effectively completes the near real-time collection and storage of log data in the front-end server, and the optimized HBase makes full use of I/O and CPU resources in the cluster in the log storage process, and the load is more balanced and effectively solves HBase ". 3. (3) the traditional data mining algorithm is parallelized by the depth analysis of the offline log data based on MapReduce and the MapReduce computing model, and the algorithm is transplanted to the platform to implement the global knowledge extraction of the historical log data in the HBase into the off-line knowledge base. And the K-means is applied to the actual application. The experimental results show that the platform can effectively extract high reliability knowledge from the history log data and make the depth analysis more efficient and extensible by using the MapReduce parallel technology. The experimental results show that the platform can extract high reliability knowledge from the history log data effectively. The experimental results show that the platform can extract high reliability knowledge from the history log data effectively and make the depth analysis more efficient and extensible by using MapReduce parallel technology. Meet the application requirements of large-scale log data extraction. (4) real-time analysis of log flow data based on Storm and integration of Storm and Kafka to achieve the stable access of log stream data sources in real-time computing. Traditional data mining algorithms are combined with Storm model to complete the coexistence of sudden knowledge extraction of small scale real-time data in a certain time window. The real time knowledge base is entered, and the information in the shared knowledge base is used as decision support to carry out Storm real-time flow calculation for log flow data. The combination of off-line calculation and real-time calculation is completed. And many algorithms such as mixed K-means and KNN are used to perform network anomaly recognition. Experimental results show that the platform can extract real-time data effectively. The sudden knowledge and the real-time and continuous calculation of high precision depend on the shared knowledge base. The application of Storm technology makes the real-time analysis more real-time and has a great advantage in the flow data processing. In summary, the log data processing platform constructed by this research has effectively solved the log data mining. The problems of collection, storage and knowledge extraction are combined with the advantages of Hadoop and Storm. At the same time using MapReduce to extract the global knowledge hidden in the historical log data, the sudden knowledge of small scale real-time log data is extracted based on Storm, and the two kinds of knowledge extracted are used to use Storm traditional stream processing to the real-time log data. Real time continuous computation can provide new technical reference for log data collection, storage and analysis, and has certain practical and promotional value.

【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】