重复数据删除系统的性能优化研究

发布时间：2018-03-31 16:07

本文选题：重复数据删除　切入点：索引机制　出处：《华中科技大学》2013年硕士论文

【摘要】：随着互联网、移动互联网和信息技术的更新和发展，企业越来越意识到信息的载体-数据对于企业发展所起到的决定性作用。进入大数据时代，数据的爆炸式增长使得重复数据删除技术受到包括学术界和商业领域越来越多的关注。去重率是具有重复数据删除功能的存储系统必须考虑的一个重要因素。因为是基于文件的相似性，索引方式Extreme Binning有可能会因为文件之间缺乏相似性而导致不能识别和消除大量重复数据。数据碎片是重复数据删除系统中另一个急需解决的问题，，它会影响系统的读性能，导致重复数据删除系统的恢复性能不好。为了进一步提升去重率，设计并实现了一种新的索引方式-Segment Index，不同于Extreme Binning，Segment Index基于段的相似性，而不是传统的文件相似性，因此能够更好地挖掘数据块之间的相似性，从而在消耗更少系统负载的前提下提供更高的去重率。为了解决重复数据删除系统带来的数据块碎片问题，设计并实现了一种重写策略-CFL（Chunk Fragmentation Level），通过计算系统当前的碎片化程度，决定是否对某些重复数据块的重写来提高系统的读性能。综合测试表明：采用Segment Index能够删除93.02%到99.91%的重复数据，而同样条件下，Extreme Binning能够删除85.15%到97.46%的重复数据。系统采用CFL策略后，读性能比不用任何重写策略提高了大约58.7%。
[Abstract]:With the renewal and development of Internet, mobile Internet and information technology, enterprises are becoming more and more aware of the decisive role that information carrier-data plays in the development of enterprises. With the explosive growth of data, repeated data deletion technology has attracted more and more attention in both academic and commercial fields. The removal rate is an important factor that must be taken into account in storage systems with repetitive data deletion. To be based on file similarity, Extreme Binning may not recognize and eliminate a large amount of duplicate data because of the lack of similarity between files. Data fragmentation is another urgent problem in duplicate data deletion system, which will affect the read performance of the system. The recovery performance of the duplicate data deletion system is not good. In order to further improve the removal rate, a new indexing method, namely, -Segment Index, is designed and implemented, which is different from the similarity of segment based on Extreme binding Segment Index, rather than the traditional similarity of files, so it can better mine the similarity between blocks of data. In order to solve the problem of data block fragmentation caused by repeated data deletion system, a rewriting strategy-CFL / chunk Fragmentation level is designed and implemented to calculate the current fragmentation degree of the system. Determines whether to rewrite certain duplicate data blocks to improve read performance of the system. The comprehensive test shows that using Segment Index can delete 93.02% to 99.91% of repeated data, while extreme Binning can delete 85.15% to 97.46% of repeated data under the same conditions. After using CFL strategy, the reading performance of the system can be improved by 58.7% than that without any rewriting strategy.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】