面向农业科学数据的分布式存储系统的研究与实现

发布时间：2018-04-26 10:04

本文选题：农业科学数据 + 分布式存储　；参考：《北京工业大学》2015年硕士论文

【摘要】：农业科学数据存储是农业科学研究的重要部分。现有农业存储系统在性能、存储容量、数据的可靠性、存储成本等方面存在很大的不足。为了解决农业科学数据的PB级非结构化且形式多样的数据存储难题,本文对农业科学数据文件进行深入分析,并展开对分布式存储技术的研究,提出了基于开源云平台Hadoop的分布式存储系统的解决方案。取得的主要成果如下:1)根据农业科学数据的自身特点与应用需求,本文设计了面向农业科学大数据的分布式存储系统的框架模型。该模型将非结构化的文件数据存入改进的HDFS架构中,将异构、结构化的属性数据存入HBase数据库系统,给出了保证数据文件与数据属性之间的关联性的设计方案,并且在Client端与数据节点端设置缓存,提高了文件的存取效率。2)面对农业科学数据中含有海量小文件的情况,本文给出了基于多属性的海量农业科学小文件合并存储策略。通过将农业科学数据中的小文件按照特定属性进行分类,将属于同一分类的数据合并成一个大的聚合文件,有效的降低了海量小文件对中心节点内存的消耗,提高了文件的存取效率;通过创建并缓存了小文件到聚合文件的索引,改善系统中农业科学数据读取的性能。3)针对农业科学数据文件因季节性强而导致的热点数据问题,提出了动态副本管理策略,包括两个方面的内容:一方面,基于文件访问频率的动态副本添加和删除方法,通过统计文件在固定的时间内访问频率,计算出文件使用的热度,并综合考虑统计周期、文件缓存等因素,动态调整文件副本的数量;另一方面,基于节点状态的副本动态放置方法,通过全面的考虑描述数据节点状态的多个参数,计算每个节点的性能,选择最优的存放节点,以改善系统性能以及文件读取效率。基于上述研究成果,本文设计并实现了面向农业科学大数据的分布式存储系统AGRFS。AGRFS实现了基本功能模块以及用户访问接口,并且搭建了一个Hadoop集群,通过实验来验证了上述策略的可行性以及本系统的可用性。结果表明,本文提出的小文件存储策略以及动态副本管理策略提高了小文件的读写操作效率,优化了系统的性能,同时本文设计的分布存储系统也能很好解决农业科学数据存储问题。
[Abstract]:Agricultural science data storage is an important part of agricultural science research. The existing agricultural storage system has great shortcomings in performance, storage capacity, data reliability, storage cost and so on. In order to solve the problem of unstructured and diverse data storage in PB level of agricultural scientific data, this paper makes a deep analysis of agricultural scientific data files, and develops the research on distributed storage technology. This paper presents a solution of distributed storage system based on open source cloud platform Hadoop. The main achievements are as follows: (1) according to the characteristics and application requirements of agricultural scientific data, this paper designs a framework model of distributed storage system for agricultural science big data. In this model, the unstructured file data is stored in the improved HDFS architecture, the heterogeneous and structured attribute data is stored in the HBase database system, and the design scheme to ensure the relationship between the data file and the data attribute is given. And the cache is set in the Client and the data node to improve the file access efficiency. 2) in the face of the large amount of small files in the agricultural science data, this paper presents a multi-attribute based storage strategy for the large amount of small files in agricultural science. By classifying small files in agricultural scientific data according to specific attributes, the data belonging to the same classification can be merged into a large aggregate file, which effectively reduces the memory consumption of large amounts of small files to the central node. Improve the efficiency of file access; improve the performance of agricultural science data reading in the system by creating and caching the index of small files to aggregate files.) aiming at the hot data problems caused by the seasonality of agricultural science data files, A dynamic replica management strategy is proposed, which includes two aspects: on the one hand, the method of adding and deleting dynamic replicas based on file access frequency is proposed. On the other hand, the dynamic placement method of replica based on node state is used to describe several parameters of data node state, such as statistical period, file cache and other factors, and dynamically adjusts the number of file replicas, on the other hand, the dynamic placement method of replica based on node state is considered comprehensively. The performance of each node is calculated and the optimal storage node is selected to improve system performance and file reading efficiency. Based on the above research results, this paper designs and implements the basic function module and user access interface of AGRFS.AGRFS, a distributed storage system for agricultural science big data, and builds a Hadoop cluster. The feasibility of the strategy and the availability of the system are verified by experiments. The results show that the small file storage strategy and the dynamic copy management strategy proposed in this paper can improve the reading and writing efficiency of small files and optimize the performance of the system. At the same time, the distributed storage system designed in this paper can also solve the problem of agricultural science data storage.
【学位授予单位】：北京工业大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP333

【相似文献】