编码容错的云存储系统与性能优化

发布时间：2018-10-04 19:54

【摘要】：在云计算的时代大潮中,海量数据的存储与数据分析成为了IT行业巨头角逐的竞技场,云计算基础设施之一的分布式文件系统受到了广泛的关注与研究应用。而目前,最流行的分布式文件系统容错机制的多副本机制,虽然在大规模并行计算上有好的表现,但在控制冗余开销、提高系统的容错等级上的表现远远不及纠删码容错机制,纠删码在这方面则有着不可比拟的优势。设计并实现一个编码容错的分布式文件系统,支持不同的编码方案与容错机制,使得用户与应用开发者能够根据情况选择最合适的策略与机制,在数据爆发时代节省数据存储量,提高系统运行效率,具有非常重要的意义。为了研究编码容错云存储系统的性能,本文以开源云存储平台HDFS为基础,通过融合信息论编码技术,设计并实现了一个编码容错的云存储系统,该系统支持不同的容错策略与编码方案,支持尽可能多的文件基础操作以满足用户的需求,并对文件切分方案进行了研究,建立粒度可控的优化机制,大大提高了文件随机读和文件追加的性能。本文的工作内容与研究成果有1.设计并实现了编码通用的分布式文件系统。不同的容错机制在不同的指标上有着各自独有的优势,不同的编码方案有着不同的编解码效率与冗余开销,而不同的应用则有着不同的存储需求,因此,本文针对大规模数据应用设计并实现了一个编码普适容错的分布式文件系统,该系统运行在普通的商用存储服务器上,有着很好的容错性,并能够将多副本的容错机制作为编码的一种特殊情况包含进来,并支持各种不同的编码方案,使得用户与应用能够选择最适合其存储需求的方案。2.设计了一个基于传输单位的细粒度文件切分方案。基于多副本容错机制的GFS和基于纠删码的HDFS Raid等系统都采用了基于文件存储单位的粗粒度文件切分方案,该方案在编码文件系统中会造成很多文件基础操作,如随机读、文件追加的低效。对此,本文设计并实现了基于文件传输单位的细粒度文件切分方案,实验数据与理论分析表明,该方案在各种文件基础操作表现出不低于粗粒度文件切分方案的性能。3.在细粒度的文件切分方案的基础上,实现了高效的文件随机读和文件追加操作。在大数据的应用场景下,很多分布式文件系统设计理念为数据文件往往是一次写入、多次串行的只读。本文注意到随机读和文件追加也是很多应用的基础,细粒度文件切分方案能够高效的支持以上两种文件操作。因此本文的最后部分实现了高效的文件随机读和文件追加操作,并对比粗粒度切分方案予以分析其操作效率。
[Abstract]:In the era of cloud computing, mass data storage and data analysis has become the competition arena of the IT industry giants. Distributed file system, one of the cloud computing infrastructure, has received extensive attention and research and application. At present, the most popular multi-replica mechanism of distributed file system fault-tolerant mechanism, although it has a good performance in large-scale parallel computing, but in the control of redundant overhead, improve the fault-tolerance level of the system performance is far less than erasure code fault-tolerant mechanism. Erasure codes have incomparable advantages in this respect. A distributed file system with coding fault tolerance is designed and implemented to support different coding schemes and fault-tolerant mechanisms so that users and application developers can choose the most appropriate strategies and mechanisms according to the situation. It is of great significance to save data storage and improve system efficiency in the era of data burst. In order to study the performance of coding fault-tolerant cloud storage system, this paper designs and implements a coding fault-tolerant cloud storage system based on open source cloud storage platform HDFS. The system supports different fault-tolerant strategies and coding schemes, supports as many file infrastructure operations as possible to meet the needs of users, and studies the file segmentation scheme to establish an optimization mechanism with controllable granularity. It greatly improves the performance of random reading and file appending. The contents and research results of this paper are 1. A general distributed file system is designed and implemented. Different fault-tolerant mechanisms have their own unique advantages in different indexes, different coding schemes have different coding and decoding efficiency and redundant overhead, and different applications have different storage requirements. In this paper, we design and implement a distributed file system with universal coding fault tolerance for large-scale data applications. The system runs on a common commercial storage server and has good fault tolerance. The fault-tolerant mechanism of multiple replicas can be included as a special case of coding, and various coding schemes can be supported, so that users and applications can choose the scheme that is most suitable for their storage requirements. A fine-grained file segmentation scheme based on transport unit is designed. Systems such as GFS based on multi-replica fault-tolerant mechanism and HDFS Raid based on erasure code adopt coarse-grained file segmentation scheme based on file storage unit. This scheme will cause a lot of file basic operations in the coding file system, such as random reading. Inefficient file append. In this paper, a fine grained file segmentation scheme based on file transfer unit is designed and implemented. The experimental data and theoretical analysis show that the performance of this scheme is no less than that of coarse-grained file segmentation scheme. On the basis of fine-grained file segmentation scheme, efficient file random reading and file appending operation are realized. In big data's application scenario, many distributed file system design idea is that data file is always written once and read only serially. This paper notes that random reading and file appending are also the basis of many applications. The fine-grained file segmentation scheme can efficiently support the above two file operations. Therefore, in the last part of this paper, efficient random reading and file appending are realized, and the operation efficiency is analyzed by comparing coarse-grained segmentation scheme.
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【相似文献】