海量数据小文件分布式存储系统的设计与实现
发布时间:2018-04-10 14:18
本文选题:海量小文件 + 文件系统 ; 参考:《湖南大学》2013年硕士论文
【摘要】:近年,由于互联网的发展,导致海量信息的传输和存储的场景日益增多,在这种背景下,数据存储技术也得到了快速发展。由于互联网的信息以海量小文件居多,所以作为海量小文件存储技术的一个重要研究方向,分布式文件系统是当今的研究热点。目前,在分布式文件系统中存储海量小文件时,还普遍存在着存储性能不高、存储空间利用率低、性能瓶颈及单点故障等问题,因此,如何解决目前海量小文件数据的存储和传输中存在的诸多实际问题,是当前计算机存储技术研究领域中非常重要的工作。 首先,针对上述问题,本文提出了一种在单个数据节点中存储海量小文件的数据分块方案。在该方案中,对小文件的概念及算法进行了描述,并定义了文件块的块内利用率,块内相关率及块间相关率三个指标,根据这三个指标,可以对每个文件块中小文件分布的情况进行量化的考核,再衡量文件块对于查询数据的影响,最后可以有针对性的进行优化。 其次,提出了一种给予小文件存储的数据副本数确定算法。这种算法以小文件副本所在的数据节点可靠性为参数,,该参数能够快速确定小文件的可靠性,系统可以根据此可靠性来决定当前的小文件副本数量是否满足要求。在此基础上,提出了一种灵活的小文件副本弱一致性维护方案。 第三,在分析海量小文件分布式存储系统的功能和性能需求的基础上,提出了整个小文件存储及管理系统的框架,该框架主要从数据节点DataNode、数据管理服务器DataServer、文件块倒排表、文件倒排表与目录的管理、相应的API函数等四个主要方面对海量小文件分布式存储进行了设计和实现。 最后,为了评估系统的整体性能,对系统进行了测试。通过分析与测试一些关键性指标与性能,得出整个系统的性能基本达到设计要求,能够满足实际环境的要求的结论。
[Abstract]:In recent years, due to the development of the Internet, there are more and more scenes of mass information transmission and storage. In this context, data storage technology has also been rapidly developed.Distributed file system (DFS), as an important research direction of storage technology of large amount of small files, is one of the most popular research fields because of the large amount of small files on the Internet.At present, when storing large amount of small files in distributed file system, there are still some problems such as low storage performance, low utilization of storage space, performance bottleneck and single point failure, etc.How to solve many practical problems existing in the storage and transmission of large amounts of small file data is a very important work in the field of computer storage technology.Firstly, in order to solve the above problems, this paper proposes a data partitioning scheme for storing large amounts of small files in a single data node.In this scheme, the concept and algorithm of small files are described, and three indexes of the intra-block utilization ratio, intra-block correlation rate and inter-block correlation rate of the file block are defined.The distribution of small and medium files in each file block can be evaluated quantitatively, then the impact of file block on query data can be measured. Finally, the optimization can be carried out pertinently.Secondly, an algorithm for determining the number of copies of data stored in small files is proposed.This algorithm takes the reliability of the data node in which the small file copy is located as a parameter, and the parameter can quickly determine the reliability of the small file, according to which the system can determine whether the current number of small file replicas meets the requirements.On this basis, a flexible weak consistency maintenance scheme for small file replicas is proposed.Thirdly, on the basis of analyzing the function and performance requirement of the massive small file distributed storage system, this paper puts forward the framework of the whole small file storage and management system. The framework mainly consists of data node data Node, data management server data Server, file block inverted table.Four main aspects of file inverted table and directory management, corresponding API function, etc., are designed and implemented for distributed storage of large amount of small files.Finally, in order to evaluate the overall performance of the system, the system was tested.By analyzing and testing some key indexes and performance, it is concluded that the performance of the whole system basically meets the design requirements and can meet the requirements of the actual environment.
【学位授予单位】:湖南大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP333
【参考文献】
中国期刊全文数据库 前4条
1 程莹;张云勇;徐雷;房秉毅;;基于Hadoop及关系型数据库的海量数据分析研究[J];电信科学;2010年11期
2 杨希;赵跃龙;周云霞;;智能网络磁盘集群负载平衡研究[J];计算机工程与应用;2011年04期
3 栾亚建;黄烂
本文编号:1731554
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1731554.html