虚拟机镜像文件去重技术研究

发布时间：2018-11-05 14:08

【摘要】：虚拟机技术和虚拟计算环境是计算机科学近年来最瞩目的成就之一,虚拟机镜像文件作为其存储与传输的载体,将内容用某一种特定的文件格式进行存储,为云计算带来了极高的便捷性。但随着用户创建"一次性"虚拟机数量的增加,云平台中虚拟机镜像文件数量也随之骤增,产生的冗余数据为云计算供应商带来了巨大挑战,因此对虚拟机镜像文件进行重复数据删除十分必要。既有文献在对相同虚拟机镜像文件的去重研究中,在去重粒度的选择与划分上存在一定的不足,即利用Hash在文件级层面进行数据去重,而忽略了虚拟机镜像文件之间的相似性,因此对于相似的虚拟机镜像文件的研究还属空白。为了解决这一问题本文提出、设计并实现基于SimHash的不同粒度分级去重方案,解决了相似虚拟机镜像文件去重问题,达到了提高存储空间利用率,节省网络带宽的目的。论文的主要内容如下:(1)对虚拟机镜像文件以及镜像文件格式和虚拟机镜像文件的相似性做了详细的分析,分析结果表明虚拟机镜像文件的格式与数据冗余有密切的联系,同一格式的镜像文件之间存在超过60%的相似数据,证明研究相同镜像文件以及相似镜像文件重复数据删除的必要性。(2)设计并实现了一种基于SimHash算法的虚拟机镜像文件分级重复数据删除方案。该方案利用固定尺寸的分块技术将镜像文件分割成若干数据块,使用改进后的SimHash函数计算其SimHash值并作为唯一标识,预传SimHashID来减少网络传输开销,对文件进行相似性对比实现分级去重,第一级以文件为对象,第二级以数据块为对象。在指纹搜索引入过滤器减少磁盘索引次数。(3)对实现的方案进行试验测试。对重复数据删除率、重复数据删除准确率,可行性及稳定性进行了试验,并与原有的数据去重方案进行的对比。实验结果表明了此方案的可行性,并在去重率以及去重准确率上存在一定的优势,可以节省将近60%的存储空间,但在稳定性上存在一定的不足,需要进一步研究并解决。
[Abstract]:Virtual machine technology and virtual computing environment are one of the most remarkable achievements in computer science in recent years. Virtual machine image file is used as the carrier of storage and transmission, and the content is stored in a specific file format. Cloud computing brings great convenience. But as the number of "one-off" virtual machines created by users increases, so does the number of virtual machine mirroring files on cloud platforms, creating redundant data that poses a huge challenge to cloud computing vendors. So it is necessary to delete duplicate data from virtual machine image file. In the research of the same virtual machine image file, the existing literature has some shortcomings in the selection and partition of the de-granularity, that is, using Hash to remove the data at the file level, but neglecting the similarity between the virtual machine mirror files. Therefore, the study of similar virtual machine image files is still blank. In order to solve this problem, this paper proposes to design and implement different granularity gradation de-reduplication scheme based on SimHash, solve the problem of image de-reduplication of similar virtual machine, and achieve the purpose of improving storage space utilization and saving network bandwidth. The main contents of this paper are as follows: (1) the similarity of virtual machine image file and image file format and virtual machine image file are analyzed in detail. The results show that the format of virtual machine image file is closely related to data redundancy. There is more than 60% similar data between mirrored files in the same format, It is necessary to study the same image file and similar image file duplicate data deletion. (2) A scheme of virtual machine image file hierarchical duplicate data deletion based on SimHash algorithm is designed and implemented. In this scheme, the image file is divided into several data blocks by using the fixed size block technology. The improved SimHash function is used to calculate its SimHash value and is used as a unique symbol to reduce the network transmission overhead by pre-transmitting SimHashID. The similarity comparison of the files is carried out in a hierarchical way. The first level takes the file as the object and the second level takes the data block as the object. A filter is introduced into fingerprint search to reduce the number of disk indexes. (3) A test of the scheme is carried out. The rate of repeated data deletion, the accuracy, feasibility and stability of repeated data deletion are tested and compared with the original data removal scheme. The experimental results show that the scheme is feasible and has some advantages in the removal rate and accuracy rate, which can save nearly 60% storage space, but there are some shortcomings in the stability, which need to be further studied and solved.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP333;TP302

【参考文献】