基于HDFS的分布式存储研究与应用

发布时间：2019-02-28 18:34

【摘要】：信息科技发展导致大规模数据存储变得更平常。并且数据需要长期保存，同时数据规模在不断增长。而传统的文件系统在存储空间和存储速度、安全上达不到大量数据保存处理的要求。分布式文件系统能够存储海量数据的，，是大规模数据存储的重要技术手段。而近年Hadoop作为一种存储和处理大规模数据的解决方案，受到国内外各大公司的热捧。Hadoop分布式文件系统是Hadoop的两大核心之一，可以作为大规模数据存储的解决方案。对基于HDFS的分布式存储进行相关研究，主要包括HDFS集群中小文件处理、副本存放策略和机架感知以及NameNode备份恢复机制和扩展机制。HDFS集群中小文件处理包括三种方案，分别是Hadoop Archive、 Sequence File和CombineFileInputFormat。副本存放策略和机架感知能够让NameNode获取DataNode的网络拓扑图，然后根据DataNode之间的关系来确定副本存放的位置，保证数据可靠性的同时兼顾了数据传输效率。NameNode备份恢复机制通过定期对NameNode中元数据信息备份合并形成新的检查点checkpoint保证NameNode元数据的安全。如果NameNode出现宕机故障，可以节省NameNode重启时间，甚至恢复丢失的数据。HDFS的可扩展性体现在动态新增DataNode，能够满足大规模数据增长的需求。最后基于HDFS集群进行相关应用，并比较HDFS集群与FTP文件传输效率，反映HDFS作为大规模数据存储的解决方案的可行性。
[Abstract]:The development of information technology has made large-scale data storage more common. And the data needs to be preserved for a long time, and the size of the data is growing. However, the traditional file system can not meet the requirement of saving and processing a large amount of data safely in storage space and storage speed. Distributed file system can store massive data, is an important technical means of large-scale data storage. In recent years, as a solution to store and process large-scale data, Hadoop is a popular solution for large-scale data storage at home and abroad. Hadoop distributed file system is one of the two cores of Hadoop, and it can be used as a solution for large-scale data storage. This paper studies the distributed storage based on HDFS, mainly including HDFS cluster small file processing, copy storage strategy and rack awareness, and NameNode backup recovery mechanism and extension mechanism. Hadoop Archive, Sequence File and CombineFileInputFormat., respectively. Replica storage policy and rack awareness allow NameNode to obtain a network topology diagram of DataNode and then determine where the copy is stored based on the relationship between the DataNode. The NameNode backup and recovery mechanism guarantees the security of NameNode metadata by periodically merging metadata backup in NameNode to form a new checkpoint. If NameNode fails, it can save NameNode restart time and even recover lost data. The scalability of HDFS is that the dynamic new DataNode, can meet the demand of large-scale data growth. Finally, the related applications based on HDFS cluster are carried out, and the efficiency of file transfer between HDFS cluster and FTP is compared, which reflects the feasibility of HDFS as a solution for large-scale data storage.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP333

【引证文献】