基于Hadoop的证据保全系统的研究与实现

发布时间：2018-06-08 15:57

本文选题：云服务 + Hadoop　；参考：《电子科技大学》2014年硕士论文

【摘要】：随着互联网和移动互联网的飞速发展,数据已经呈现出指数级增长的态势。面对海量数据带来的挑战,国内外各大互联网公司纷纷将云计算的概念应用到商业服务中,并推出了各自的云服务。云服务是将各种计算资源和商业应用程序以互联网为基础提供给用户的服务,这些服务将数据的处理过程从个人计算机或服务器转移到互联网的数据中心,从而减少用户在硬件、软件和专业技能方面的投资。目前云服务已经被广泛应用到各个商业场景中,并发展成为一个非常成熟的商业服务模式。本文基于Hadoop,主要完成以下工作:1,设计并实现针对于云服务的证据保全系统。证据保全系统主要实现以下功能:首先在云服务商与用户之间架设网关服务器,根据云服务商指定的过滤条件获取所有用户对于指定的云服务API的HTTP请求,并提取出用户特征信息。用户特征信息主要包括:用户名称、用户发起请求的时间、用户所在地域、用户请求的云服务API以及云服务API的参数。然后网关将用户特征信息导入到数据分析系统,数据分析系统将按照云服务商指定的数据分析条件对用户特征信息进行分析,分析结果将以报表的形式展示给云服务商,最后数据分析系统将根据云服务商指定的归档条件将用户特征信息归档到存储系统以永久保存。考虑到云服务商的用户基数巨大,证据保全系统所要处理的数据量预计将维持在PB级别,因此证据保全系统将采用云计算平台Hadoop作为数据分析系统和存储系统的底层实现。2,证据保全系统定期根据多种归档条件将用户特征信息归档存储在HDFS(Hadoop分布式文件系统)中。归档操作根据不同的归档条件将用户特征信息划分成大量文件,其中既存在大量文件长度在GB级的大文件,也存在大量KB级的小文件。而HDFS是针对大文件存储而设计的,大量小文件的存储将导致HDFS集群整体性能降低。因此,本文将通过仔细阅读Hadoop源码,分析导致HDFS存储大量小文件后性能降低的原因,并在此基础上提出HDFS客户端聚合索引策略,在客户端对小文件进行聚合并建立索引,以实现对HDFS小文件存储的优化。
[Abstract]:With the rapid development of the Internet and mobile Internet, the data has shown exponential growth trend. In the face of the challenges brought by massive data, the concept of cloud computing has been applied to business services and their own cloud services have been introduced by the major Internet companies at home and abroad. Cloud services are services that provide users with Internet-based computing resources and business applications that move the processing of data from personal computers or servers to the Internet's data centers. This reduces user investment in hardware, software, and professional skills. Cloud service has been widely used in various business scenarios, and has developed into a very mature business service model. Based on Hadoop, this paper mainly completes the following work: 1, designs and implements the evidence preservation system for cloud services. The main functions of the evidence preservation system are as follows: first, the gateway server is set up between the cloud service provider and the user, and the HTTP request of all users for the specified cloud service API is obtained according to the filtering condition specified by the cloud service provider. The user characteristic information is extracted. The user characteristic information mainly includes: user name, the time when the user initiated the request, the location of the user, the cloud service API requested by the user and the parameters of the cloud service API. Then the gateway will import the user characteristic information into the data analysis system, and the data analysis system will analyze the user characteristic information according to the data analysis conditions specified by the cloud service provider, and the analysis results will be displayed to the cloud service provider in the form of a report form. Finally, the data analysis system will file the user characteristic information into the storage system according to the archiving condition specified by the cloud service provider for permanent storage. Considering that the cloud service provider has a large user base, the amount of data to be processed by the evidence preservation system is expected to remain at the PB level, Therefore, the cloud computing platform Hadoop will be used as the underlying implementation of the data analysis system and storage system, and the evidence preservation system will store the user feature information in the HDFSU Hadoop distributed file system according to a variety of archiving conditions. According to different archiving conditions, the archiving operation divides the user characteristic information into a large number of files, in which there are not only a large number of large files in GB level, but also a large number of small files of KB level. HDFS is designed for large file storage, and a large number of small files storage will lead to the overall performance of HDFS cluster degradation. Therefore, this article will read the Hadoop source code carefully, analyze the reason that causes HDFS to store a large number of small files, and on this basis put forward HDFS client aggregation index strategy, aggregate and index small files on the client side. To realize the optimization of HDFS small file storage.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.09;TP311.13

【参考文献】