基于Hadoop分布式系统的重复数据检测技术研究与应用

发布时间：2018-05-29 10:26

本文选题：云计算 + Hadoop　；参考：《湖南大学》2013年硕士论文

【摘要】：随着信息技术的快速发展，云计算和重复数据删除技术也得到了迅速的发展。云计算凭借其强大的分布式计算能力以及低成本高可靠性的优势，在海量数据处理方面占据主导地位，但是Hadoop系统中数据进行归档时，存在大量重复数据，影响系统的处理效率。重复数据删除技术是一种热门的存储技术，可对存储容量进行优化，很大程度上减少对物理存储空间的浪费，从而满足日益增长的数据存储需求。因此，云计算和重复数据删除技术的结合将会是一个双赢的解决方案。针对以上问题，本文分析了当前云计算平台Hadoop和重复数据删除技术的特点后，利用Hadoop分布式平台来管理海量数据。同时，针对Hadoop系统中存在的大量重复数据，本文提出来一种基于重复数据删除技术的去重检测技术，利用指纹算法BLAKE生成数据块指纹，采用基于数据块级的删除粒度，使用In-line方式有效删除重复数据。哈希SHA-3算法凭借其在数据运算上的优势，得到业界的认可，，本文首次采用SHA-3候选算法BLAKE作为重复数据检测技术中的指纹函数，取代了原始的重复数据指纹算法MD5，进行重复数据指纹的生成和指纹匹配，并单独对该算法进行详细的软件设计和实现，实验性能比传统指纹算法MD5有了很大的提高。最后将本文的研究应用到车联网中，利用Hadoop存储管理大规模车联网数据。根据HBase数据模型的特点，设计了交通数据的分布式数据存储模型，其中详细给出了主表和反向表的设计，一定程度上满足用户的条件查询。并利用重复数据删除技术对车联网归档时存在的重复数据进行去重检测，通过对三组汽车终端数据集进行实验，给出详细性能分析，大大降低了硬盘存储消耗，提高了存储效率，消除了数据存储冗余。
[Abstract]:With the rapid development of information technology, cloud computing and duplicate data deletion technology have also been rapidly developed. Cloud computing plays a dominant role in mass data processing because of its powerful distributed computing power and the advantages of low cost and high reliability. However, when archiving data in Hadoop system, there are a lot of duplicate data. Affect the processing efficiency of the system. Repetitive data deletion is a popular storage technology, which can optimize storage capacity, reduce the waste of physical storage space to a great extent, and meet the increasing demand for data storage. Therefore, the combination of cloud computing and duplicate data deletion technology will be a win-win solution. In view of the above problems, this paper analyzes the characteristics of the current cloud computing platform Hadoop and repeated data deletion technology, and uses the Hadoop distributed platform to manage the massive data. At the same time, aiming at the existence of a large number of repeated data in Hadoop system, this paper proposes a kind of de-re-detection technology based on repeated data deletion technology. The fingerprint algorithm BLAKE is used to generate data block fingerprint, and the deletion granularity based on data block level is adopted. Delete duplicate data effectively using In-line. Hash SHA-3 algorithm is recognized by the industry because of its advantage in data operation. In this paper, SHA-3 candidate algorithm BLAKE is first used as fingerprint function in repetitive data detection technology. Instead of the original repeated data fingerprint algorithm (MD5), the algorithm is used to generate and match the repeated data fingerprint, and the algorithm is designed and implemented in detail. The experimental performance is greatly improved than that of the traditional fingerprint algorithm MD5. Finally, the research is applied to vehicle networking, and Hadoop is used to store and manage large scale vehicle networking data. According to the characteristics of HBase data model, the distributed data storage model of traffic data is designed, in which the design of main table and reverse table are given in detail. And the repeated data delete technology is used to detect the duplicate data existing in the vehicle network archiving. Through the experiment of three groups of vehicle terminal data sets, the detailed performance analysis is given, which greatly reduces the storage consumption of hard disk. The storage efficiency is improved and the redundancy of data storage is eliminated.
【学位授予单位】：湖南大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】