基于Hadoop的海量文件存储系统的分析与设计

发布时间：2018-04-11 05:18

本文选题：Hadoop + 海量文件　；参考：《北京工业大学》2015年硕士论文

【摘要】：目前,互联网的浪潮是一浪高过一浪,信息化、智能化、数据化、海量化愈来愈明显,各种门户网站、电子商务网站亦越来越大型化、集团化,像腾讯、淘宝、百度、新浪等互联网巨头为提供广泛服务,数据存储量已经进入海量模式,并且以爆炸式持续增长。海量存储垂直扩容成本越来越大,对使用商业存储的企业来说负担越来越重,甚至已经成为制约企业发展瓶颈,实现高容量、支持高并发的海量文件存储系统已经迫在眉睫。通过实际需求分析搭建基于Hadoop的分布式存储系统架构,该模型以Hadoop的HDFS分布式文件系统底层文件存储为支撑,以廉价Linux集群硬件为基础,通过HDFS实现的特有的高相应、高容错、高并发支持以及集群内数据均衡来架构我们自己的海量文件存储,对外提供高可靠的服务。Hadoop中的HDFS分布式文件系统和MapReduce并行编程框架,为我们设计大规模数据存储架构提供了有力的技术支撑,最终实现在高并发、高负载的环境中对文件进行高效访问。通过缓存设计、负载均衡设计提高系统应对高并发情况,优化文件读写。海量文件存储势必会带来大规模的文件元数据存储,采用HBase分布式列式数据库存储文件元数据,满足对存储高容量、高效性要求,通过考虑文件类型、文件所属应用等因素,设计HBase行键,文件尽可能存储在物理位置较近的集群节点内,减少磁盘寻道、跨界点、跨网络寻址,提高文件访问效率。搭建Hadoop集群,部署各个应用服务器,进行高并发压力实验,采集实验数据,并对实验数据加以分析,验证系统架构是否可以实现预定目标。本课题着重解决高并发、大容量带来的挑战,能够实现水平扩展容量,降低存储成本,并能提供高效服务。该系统利用目前比较成熟的分布式技术实现文件存储和处理,搭建Hadoop集群、部署应用服务器、文件服务器、缓存服务器等。通过测试数据分析模型的实用效果,检验所提出的架构模型是否能支持海量文件存储和管理。
[Abstract]:At present, the tide of the Internet is higher than that of a wave. Information, intelligence, data, sea quantification are becoming more and more obvious. Various portals and e-commerce websites are also becoming more and more large-scale and clustered, such as Tencent, Taobao, Baidu,To provide a wide range of services, Internet giants such as Sina have entered a massive model of data storage and continue to grow explosively.The cost of vertical expansion of mass storage is increasing, which is becoming more and more burdensome to the enterprises that use commercial storage. It has even become the bottleneck to restrict the development of enterprises, realize the high capacity and support the high concurrent mass file storage system.The architecture of distributed storage system based on Hadoop is built through actual requirement analysis. The model is based on the underlying file storage of HDFS distributed file system of Hadoop, based on the cheap Linux cluster hardware, and the special high correspondence and high fault tolerance realized by HDFS.High concurrency support and data balance in the cluster are used to construct our own massive file storage and to provide highly reliable service. Hadoop HDFS distributed file system and MapReduce parallel programming framework.It provides powerful technical support for the design of large-scale data storage architecture, and finally realizes efficient access to files in a high concurrent and high load environment.Through cache design, load balancing design improves the system to deal with high concurrency and optimizes file reading and writing.Mass file storage is bound to bring large scale file metadata storage. HBase distributed column database is used to store file metadata to meet the requirements of high storage capacity and high efficiency, by considering the file type, file application and other factors.The HBase line key is designed, the files are stored in the cluster node near the physical location as much as possible, reducing the disk seeking, crossing points, addressing across the network, and improving the efficiency of file access.Build Hadoop cluster, deploy each application server, carry out high concurrent pressure experiment, collect experimental data, analyze the experimental data, and verify whether the system architecture can achieve the predetermined goal.This paper focuses on solving the challenges brought by high concurrency and large capacity, which can realize horizontal expansion of capacity, reduce storage cost, and provide efficient service.The system uses the current mature distributed technology to realize file storage and processing, build Hadoop cluster, deploy application server, file server, cache server and so on.By testing the practical effect of the data analysis model, the proposed architecture model can support the storage and management of massive files.
【学位授予单位】：北京工业大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP333

【共引文献】