面向云环境的重复数据删除关键技术研究

发布时间：2018-08-25 15:01

【摘要】：随着大数据时代的到来,信息世界的数据量呈爆炸式增长,数据中心的数据存储和管理需求已达到PB级甚至EB级。研究发现,不论是在备份、归档存储层,还是在常规的主存储层,日趋复杂的海量数据集中都有大量的重复数据。传统的数据备份技术和虚拟机镜像存储管理方法更是加速了重复数据的增长。为了抑制数据过快增长,提高IT资源利用率,降低系统能耗以及管理成本,重复数据删除技术作为一种新兴的数据缩减技术,已成为当前学术界和工业界的研究热点。云计算作为大数据的关键支撑技术,通过网络计算和虚拟化技术优化资源利用率,为用户提供廉价、高效、可靠的计算和存储服务。针对具有大量冗余数据的云备份和虚拟桌面云环境,重复数据删除技术能够极大地降低存储空间需求和提高网络带宽利用率,但也存在系统性能上的挑战。本文主要讨论：如何利用重复数据删除技术优化个人计算环境云备份服务、数据中心分布式云备份存储系统以及虚拟桌面云集群存储系统,以提高IT资源利用率和系统扩展性,降低数据消重操作对I/O性能的影响。本文在全面了解当前云计算技术发展现状的基础上,深入分析和研究了基于重复数据删除技术的云备份、大数据备份和虚拟桌面云等应用,并提出了新的系统设计和算法。主要工作和创新如下： (1)提出了基于个人计算环境云备份服务的分级应用感知源端重复数据删除机制ALG-Dedupe。本文通过对大量个人应用数据进行统计分析,首次发现了不同类型应用数据集之间共享的数据量可以忽略不计。利用文件语义指导应用数据分类,设计了应用感知的索引结构,允许应用数据内部独立并行地进行重复数据删除,并可以根据各类应用数据的特点自适应地选择数据划分策略和指纹计算函数。由于客户端本地冗余检测和云数据中心远程冗余检测这两种方法实现的源端消重策略在响应延迟和系统开销上互补,将应用感知的源端重复数据删除分为客户端的局部消重和云端的全局消重两级来进一步提高数据缩减率和减少消重处理时间。通过实验表明,ALG-Dedupe在极大提高重复数据删除效率的同时,有效地缩减了数据备份窗口和云存储成本,降低了个人计算设备的能耗和系统开销。 (2)设计了一种支持云数据中心实现大数据备份的可扩展集群重复数据删除方法E-Dedupe。该方法的新颖之处在于同时开发了数据局部性和相似性来优化集群重复数据删除。E-Dedupe结合集群节点间超块级数据路由和节点内块级重复数据删除处理,在提高数据缩减率的同时,保持数据访问的局部性；通过扩展Broder的最小值独立置换理论,首次提出采用手纹技术来提高超块相似度的检测能力；通过节点的存储空间利用率加权相似度,设计了基于手纹的有状态超块数据路由算法,将数据按超块粒度从备份客户端分配到各个重复数据删除服务器节点。利用超块手纹中的代表性数据块指纹构建相似索引,并结合容器管理机制和数据块指纹缓存策略,以优化数据块指纹查询性能。通过采用源端在线重复数据删除技术,备份客户端可以避免向目标路由节点传输超块中的重复数据块。通过大量实验表明,E-Dedupe能够在获得集群范围内高数据缩减率的同时,有效地降低了系统通信开销和内存开销,并保持各节点负载平衡。 (3)提出了一种基于集群重复数据删除的虚拟桌面云存储优化技术。为支持可扩展的虚拟桌面云服务,虚拟桌面服务器集群需要管理大量桌面虚拟机,本文通过开发虚拟机镜像文件的语义信息,首次提出了基于语义感知的虚拟机调度算法来支持基于重复数据删除的虚拟桌面集群存储系统。同时,结合服务器的数据块缓存和本地混合存储缓存,设计了基于重复数据删除的虚拟桌面存储I/O优化策略。实验分析表明,基于重复数据删除的虚拟桌面集群存储优化技术有效地提高了虚拟桌面存储的空间利用率,降低了存储系统的I/O操作数,并改进了虚拟桌面的启动速度。通过上述几项基于云环境中的重复数据删除关键技术研究,我们为未来云存储和云计算研究提供了有力的技术支撑。
[Abstract]:With the advent of the era of large data, the amount of data in the information world is explosively increasing, and the data storage and management requirements of the data center have reached PB level or even EB level. In order to restrain the rapid growth of data, improve the utilization rate of IT resources, reduce system energy consumption and management costs, duplicate data deletion technology, as a new data reduction technology, has become a research hotspot in academia and industry.
Cloud computing, as the key supporting technology of large data, optimizes resource utilization through network computing and virtualization technology to provide users with cheap, efficient and reliable computing and storage services. This paper mainly discusses how to optimize cloud backup service in personal computing environment, distributed cloud backup storage system in data center and virtual desktop cloud cluster storage system by using duplicate data deletion technology to improve IT resource utilization and system scalability, and reduce the number of users. According to the influence of weight-loss operation on I/O performance, this paper analyzes and studies cloud backup, large data backup and virtual desktop Cloud Applications Based on duplicate data deletion technology, and proposes new system design and algorithm.
(1) A hierarchical application-aware source-side duplicate data deletion mechanism ALG-Dedupe based on cloud backup service in personal computing environment is proposed. Through statistical analysis of a large number of personal application data, it is found for the first time that the amount of data shared between different types of application data sets can be neglected. An application-aware index structure is designed to allow applications to delete duplicate data independently and concurrently, and to select data partitioning strategies and fingerprint calculation functions adaptively according to the characteristics of various application data. The application-aware source-side duplicate data deletion is divided into two stages: local de-duplication on the client side and global de-duplication on the cloud side to further improve the data reduction rate and reduce the de-duplication processing time. It effectively reduces the cost of data backup window and cloud storage, and reduces the energy consumption and system overhead of personal computing devices.
(2) A scalable cluster duplicate data deletion method, E-Dedupe, is designed to support large data backup in cloud data center. The novel feature of this method is that both data locality and similarity are developed to optimize cluster duplicate data deletion. Deletion processing can not only improve the data reduction rate, but also maintain the locality of data access. By extending Broder's minimum value independent permutation theory, fingerprint technology is firstly used to improve the detection ability of superblock similarity. By using the weighted similarity of node storage space utilization, the stateful superblock data based on fingerprint is designed. Routing algorithm assigns data from the backup client to each duplicate data deletion server node according to the superblock granularity. Similar index is constructed by using the representative data block fingerprints in the superblock fingerprints, and the container management mechanism and the block fingerprint cache strategy are combined to optimize the performance of the data block fingerprint query. According to deletion technology, the backup client can avoid transferring duplicate data blocks to the target routing node. A large number of experiments show that E-Dedupe can achieve high data reduction rate within the cluster, effectively reduce the system communication overhead and memory overhead, and maintain the load balance of each node.
(3) A virtual desktop cloud storage optimization technology based on cluster duplicate data deletion is proposed. In order to support scalable virtual desktop cloud services, virtual desktop server clusters need to manage a large number of desktop virtual machines. In this paper, the virtual machine scheduling algorithm based on semantic awareness is proposed for the first time by developing semantic information of virtual machine mirror files. Meanwhile, a virtual desktop cluster storage optimization strategy based on duplicate data deletion is designed, which combines the server's data block cache with the local hybrid storage cache. The experimental results show that the duplicate data deletion based virtual desktop cluster storage optimization technology is effective. It improves the utilization ratio of virtual desktop storage space, reduces the I/O operation of storage system, and improves the start-up speed of virtual desktop.
Through the above several key technologies of data deletion based on cloud environment, we provide a strong technical support for future cloud storage and cloud computing research.
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP309.3;TP333

【参考文献】