基于云存储的重复数据删除文件系统设计与实现

发布时间：2018-03-29 00:03

本文选题：重复数据删除　切入点：云存储　出处：《华中科技大学》2013年硕士论文

【摘要】：随着在线存储需求量的增长，各大云存储公司开始计费模式的探索，只有付费才能获得更好的服务,免费的云存储空间已经不能满足用户的需求，云存储的成本问题已经开始影响用户的工作生活。针对上述问题，，提出了一种基于云存储的重复数据删除文件系统。该系统是一个具有云存储增量同步的用户端文件系统，采用重复数据删除技术，自动将用户的本地数据无冗余上传到云端。系统由六个模块构成，用户接口模块接收从Fuse内核空间传递过来的系统请求，调用相关模块完成响应。云端同步模块利用云存储开放接口，配合系统各模块进行本地与云端数据同步。文件管理模块从云端获取文件列表，建立文件索引节点，对文件进行组织管理。文件操作模块处理系统读写请求。数据重删模块在源端进行重复数据删除，该模块采用基于内容的变长切分算法，使用一个长度固定的滑动窗口对文件数据计算指纹，如果指纹模一个特定的整数等于预定的数值，就把窗口位置作为块的边界，若出现指纹相同的块则认为重复。将去重后的文件和记录数据块信息的元数据表上传到云端。垃圾回收模块在系统卸载时，回收不用的表和冗余的数据文件。利用多版本内核文件和虚拟机文件，对系统进行重复数据删除压缩比测试。结果表明，在大规模文档数据中，去重率最高达到67%。以阿里云平台计费标准核算，1TB用户数据理论上能够节省4391元/年。
[Abstract]:With the increasing demand for online storage, the major cloud storage companies began to explore the charging model. Only by paying can we get better services. Free cloud storage space can no longer meet the needs of users. The cost of cloud storage has already begun to affect the working life of users. In view of the above problems, a file system for deleting duplicate data based on cloud storage is proposed. The system is a file system with incremental synchronization of cloud storage. It automatically uploads the local data of the user to the cloud without redundancy by using repeated data deletion technology. The system consists of six modules. The user interface module receives the system request passed from the Fuse kernel space and calls the relevant module to complete the response. The file management module acquires the file list from the cloud and establishes the file index node. File management, file operation module processing system read and write request, data redelete module in the source end of repeated data deletion, the module uses content-based variable length segmentation algorithm, A fixed length sliding window is used to calculate the fingerprint of the file data. If a particular integer is equal to a predetermined value, the window position is used as the boundary of the block. If a block with the same fingerprint appears, the duplicate file and the metadata table recording the block information are uploaded to the cloud. The garbage collection module retrieves unused tables and redundant data files when the system unloads. By using multi-version kernel files and virtual machine files, the system was tested for repeated data deletion compression ratio. The results show that, in large scale document data, The highest weight removal rate is 67 yuan. According to the standard accounting standard of Ali cloud platform, one terabyte user data can be saved 4391 yuan per year theoretically.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】