面向数据备份的高效数据去重系统构建方法研究
发布时间:2018-05-19 21:42
本文选题:数据去重 + 数据块碎片 ; 参考:《华中科技大学》2016年博士论文
【摘要】:在大数据时代,如何高效地存储和管理海量数据成为存储系统研究者和实践者面临的一大挑战。大量的研究表明,冗余数据普遍存在于各类存储系统中,例如备份存储系统、桌面文件系统等。通过消除这些冗余数据,可以节约大量存储成本。在这样的背景下,数据去重作为一种高效的压缩技术,逐渐应用到各类存储系统中。然而,构建高效数据去重系统仍然存在大量的问题和挑战,例如数据块碎片、大规模指纹索引和存储可靠性等。本文将首先在备份存储系统中解决由于数据去重产生的数据块碎片问题,再研究数据去重对存储可靠性的影响,最后系统地讨论如何设计面向备份负载的高效数据去重系统。数据去重引起的数据块碎片会严重降低备份数据流的恢复性能,并且在用户删除备份后,降低垃圾回收的效率。通过对长期备份数据集的分析,发现数据块碎片主要来自两类容器:稀疏容器和乱序容器。稀疏容器直接放大读操作,而乱序容器会在恢复缓存不足时降低恢复性能,因此需要不同的解决方案。现有基于缓冲区的重写算法无法准确区分稀疏和乱序容器,导致存储效率和恢复性能低。提出了历史感知的数据块碎片处理方法,包括历史感知重写算法HAR、最优恢复缓存算法OPT、缓存感知过滤器CAF和容器标记算法CMA。历史感知的数据块碎片处理方法对稀疏容器执行重写,且通过恢复缓存减少乱序容器的影响,从而减少存储开销。HAR利用连续备份数据流的相似性精确地识别稀疏容器,是区分稀疏容器和乱序容器的关键。OPT在备份时纪录下数据块的访问顺序,用于实现Belady最优缓存算法,减少乱序容器对恢复性能的影响。为了进一步减少对恢复缓存的需求,CAF通过仿真恢复缓存准确识别和重写少量的降低恢复性能的乱序容器。为了减少垃圾回收的时间和空间开销,CMA利用HAR清理稀疏容器。在先进先出的备份删除策略下,CMA无需耗时的容器合并操作即可回收大量存储空间。由于CMA直接追踪容器的利用率,其开销与容器数量成正比,而不是与数据块数量成正比。使用4个长期备份数据集进行测试,历史感知的数据块碎片处理方法比现有算法的存储成本更低,且恢复性能更好。数据去重对存储可靠性的影响长期未知。通过消除冗余数据,数据去重可以减少磁盘的数量,因此减少遇到磁盘错误的概率;同时,数据去重会增加每次磁盘错误的严重性,因为丢失一个数据块可能会损坏多个文件。提出了一套数据去重系统可靠性的量化分析方法,引入逻辑数据丢失量的概念扩展了现有可靠性指标NOMDL,使其可以衡量数据去重系统的可靠性。设计了针对数据去重系统的可靠性仿真器SIMD。 SIMD利用企业公布的统计数据仿真扇区错误和整盘故障,并产生磁盘阵列的各类数据丢失事件。为了计算每次数据丢失事件的逻辑数据丢失量,SIMD根据真实文件系统镜像生成块级和文件级模型。通过对18个真实文件系统镜像的分析和仿真实验,发现:由于文件内部冗余的存在,数据去重可以显著减少扇区错误损坏的文件数量;但是,数据去重带来的数据块碎片增加了整盘故障的危害。为了提高存储可靠性,提出了DCT副本技术。DCT将磁盘阵列的1%物理空间分配给高引用数据块的副本,并在磁盘阵列重建时优先修复这些副本。DCT的存储开销很小,可以将因整盘故障丢失的数据块和文件分别减少31.8%和25.8%。构建高效数据去重系统还需要考虑指纹索引等其它模块对系统的影响。为了系统地理解和比较现有的设计方案,并提出新的更高效的设计方案,设计和实现了通用数据去重原型系统Destoro将数据去重系统理解为多维参数空间,每个维度代表了系统的一个子模块或参数,包括分块、指纹索引、重写算法、恢复算法等。每个参数都有若干候选设计方案。现有的设计方案和潜在的新方案被看作多维参数空间的点。Destor实现了该多维参数空间,涵盖了多种主流的数据去重系统的设计方案。研究人员利用Destor可以比较现有设计方案,并探索参数空间得到潜在的新设计方案。为了发现更高效的潜在设计方案,用3个长期备份数据集探索了数据去重系统的参数空间,关注的性能指标包括内存开销、存储成本、备份性能和恢复性能。目标方案必须能长期维持稳定的高备份性能,并在其余三个指标取得合理的权衡。得到了17个实验发现,并总结了以下满足要求的设计方案:当需求最低的存储成本时,应该采用利用逻辑局部性的精确去重;当需求最低的内存开销时,可以采用利用逻辑或物理局部性的近似去重;当需求稳定的高恢复性能时,应该采用利用物理局部性的精确去重,以及历史感知的数据块消除方法。当需求更高可靠性时,以上方案都可以采用DCT副本技术,仅增加极少存储成本而不影响备份和恢复性能。
[Abstract]:In the era of large data, how to store and manage massive data efficiently has become a major challenge for the researchers and practitioners of the storage system. A large number of studies show that redundant data are ubiquitous in various storage systems, such as backup storage systems, desktop file systems and so on. By eliminating these redundant data, a large amount of storage can be saved. In this context, data deweighting, as an efficient compression technology, is gradually applied to all kinds of storage systems. However, there are still a lot of problems and challenges in the construction of high efficient data removal systems, such as data block fragments, large scale fingerprint index and storage reliability. This paper will first solve the problem in the backup storage system. The problem of data block fragmentation generated by data deweighting, and then studying the effect of data weight on storage reliability. Finally, this paper systematically discusses how to design an efficient data deweighting system for backup load. The data block fragments caused by data deweighting will seriously reduce the recovery ability of the backup data flow, and reduce the garbage returns after the user deletes the backup. Efficiency. Through the analysis of long-term backup datasets, it is found that data block fragments are mainly from two types of containers: sparse container and disorder container. Sparse container enlarging the read operation directly, and disorderly container will reduce recovery performance when the recovery cache is insufficient. Therefore, different solutions are needed. The existing rewriting algorithm based on buffer zone is not The method accurately distinguishes sparse and disorder containers, resulting in low storage efficiency and low recovery performance. A history aware block fragment processing method, including historical perception rewriting algorithm HAR, optimal recovery cache algorithm OPT, cache perception filter CAF and container markup algorithm CMA. history sense fragment processing method to sparse container, is proposed. Rewriting, and reducing the impact of the disordered container by restoring the cache, thereby reducing the storage overhead.HAR using the similarity of the continuous backup data stream to accurately identify the sparse container. It is the key.OPT to distinguish the access order of the data blocks in the backup time, which is used to realize the Belady optimal cache algorithm and reduce the disorder sequence. The effect of the container on recovery performance. In order to further reduce the demand for recovery caching, CAF accurately identifies and rewrites a small number of disorderly containers that reduce recovery performance through the simulation recovery cache. In order to reduce the time and space overhead of garbage collection, the CMA uses HAR to clean up the sparse container. Under the advanced first out backup and delete strategy, CMA does not need to A time-consuming container combined operation can recover a large amount of storage space. Since CMA directly tracks the utilization of the container, its overhead is directly proportional to the number of containers, not the number of data blocks. Using 4 long-term backup datasets to test, the history aware block fragmentation processing method is lower than the existing algorithm, and the storage cost is much lower than the existing algorithm. Complex performance is better. The effect of data removal on storage reliability is unknown for a long time. By eliminating redundant data, data deweighting can reduce the number of disks, thus reducing the probability of encountered disk errors; at the same time, data removal increases the severity of each disk error, because the loss of a block may damage multiple files. A quantitative analysis method of the reliability of data deweighting system is introduced, and the existing reliability index NOMDL is expanded by introducing the concept of logical data loss, so that it can measure the reliability of the data deweighting system. The reliability simulator SIMD. SIMD for the data deweighting system is designed to make use of the statistical data simulation sector error and integration published by the enterprise. Disk failure and all kinds of data loss events of disk array. In order to calculate the logical data loss of each data loss event, SIMD generates block level and file level model according to the real file system image. Through the analysis and simulation experiments of 18 real file system mirrors, it is found that the data is redundant in the file. Deweighting can significantly reduce the number of files damaged by the sector; however, the data block fragments brought by the data deweighting add to the damage of the whole disk. In order to improve the storage reliability, the DCT copy technology.DCT assigns the 1% physical space of the disk array to a copy of the high reference block, and the priority of the repair when the disk array is rebuilt. The storage cost of these replicas.DCT is very small, and the effect of other modules such as fingerprint index and other modules on the system can be considered by reducing the data blocks and files lost by the whole disk and file respectively by 31.8% and 25.8%. to build the high efficient data deweighting system. A general data deweighting prototype system Destoro is designed and implemented to understand the data deweighting system as a multidimensional parameter space. Each dimension represents a sub module or parameter of the system, including block, fingerprint index, rewriting algorithm and recovery algorithm. Each parameter has a number of candidate designs. The existing design scheme and potential new scheme are available. The point.Destor, which is considered as a multidimensional parameter space, implements the multidimensional parameter space and covers a variety of mainstream data deweighting systems. The researchers use Destor to compare existing design schemes and explore the potential new design schemes in parameter space. In order to find more efficient potential design schemes, 3 long-term backups are used. The data set explored the parameter space of the data deweighting system. The performance indicators concerned include memory overhead, storage cost, backup performance and recovery performance. The target scheme must maintain stable high backup performance for a long time and make a reasonable tradeoff between the remaining three indicators. 17 experimental findings are obtained, and the following requirements are summarized. Design scheme: when the minimum storage cost is required, the precise removal of the logical locality should be used; when the minimum memory cost is required, the approximation of the logic or physical locality can be used; when the high recovery performance is stable, the precise removal of the physical locality should be used, and the historical perception should be used. When the requirements are higher, the above schemes can use DCT replica technology, which only increases the minimal storage cost without affecting the backup and recovery performance.
【学位授予单位】:华中科技大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP333
,
本文编号:1911829
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/1911829.html