面向车驾管业务的海量小文件存储研究与优化

发布时间：2018-03-06 21:24

本文选题：海量小文件　切入点：HDFS　出处：《广西师范大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着计算机互联网的发展与信息时代的到来,在天文、地理、气象、电子商务等诸多领域,其使用的应用系统中已积累了数量惊人的数据,同时这些数据被分散成多个容量很小的文件进行存储。并且诸如银行、邮政、车管所等这些便利民众的服务行业也开始结合互联网形成“互联网+”的发展模式,并为了满足其自身的某种需求而逐渐产生出亿级以上的海量小文件,且这些文件数量仍旧处在爆炸性增长中,这给系统存储效率、检索及元数据管理带来巨大挑战。在大数据时代背景下,根据《互联网交通安全综合服务平台建设指导意见》(公交管(2013)433号)要求,为了推进“互联网+车管所”的大数据平台建设,本文针对南宁市车管所车驾管业务系统的需要,构建一个基于Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)面向车驾管业务的海量小文件存储系统VDSMSS(Vehicle-Driving ServiceMassStorageSystem),为“互联网+车管所”的大数据平台打下基础,同时也为当今针对服务行业的基于HDFS的海量小文件存储系统设计,提供一个有效解决思路和优化方向,具有现实意义和价值。本文的主要研究内容如下:(1)简述HDFS的核心架构及其内部的关键数据结构。介绍当前业界面对海量小文件时使用的存储优化方案,分析其中几个代表性方案的优缺点。介绍几种代表性的缓存替换算法,重点介绍文件系统ZFS的自调整缓存替换算法(ZFS Adjustable Replacement Cache,ZFS-ARC)。(2)分析HDFS存储海量小文件时存在的问题,确定优化方向。总结阐述车驾管业务系统中小文件所具有的特点,针对该特点,设计将多个小文件以用户为单位,按时间业务分组,合并成一个大文件,从而减少小文件的数量,达到降低大量小文件元数据占用NameNode内存的目的。同时设计一个高效的单一文件查找方法和批量查找索引,能在兼顾检索文件速度的情况下,以一定查询条件进行批量文件查找。(3)针对HDFS没有在文件读写方面提供预取和缓存功能的问题,本文提出了一个基于文件关联度预读机制的自调整缓存替换算法。首先采用传统的关联规则挖掘算法,对存储有小文件访问记录的Hadoop日志文件进行关联挖掘,将挖掘出的数据进行合乎理论的数理分析,计算出小文件之间的潜在关联度。其次以此关联度设计出一个合适的文件预读机制,当某个小文件被读取时,则把该小文件的关联文件预读到缓存中。然后将兼顾了“时间”和“频率”的缓存替换算法ZFS-ARC与本文设计的预读机制结合,改进出一个基于文件关联度预读机制的自调整缓存替换算法PRE-ZFSARC,用以提高VDSMSS的小文件读取性能。最后通过实验对比分析,证明本文方案的有效性。最终完成海量小文件存储系统的性能优化,使其高度适用于车管所车驾管业务系统。
[Abstract]:With the development of the computer Internet and the arrival of the information age, in many fields, such as astronomy, geography, meteorology, electronic commerce and so on, its application system has accumulated a surprising amount of data. At the same time, this data is scattered into a number of very small files for storage. And services such as banks, postal services, car management offices and so on are beginning to combine the Internet to form a "Internet" development model. And to meet its own needs and gradually produce a large number of small files of more than 100 million levels, and the number of these files are still explosive growth, which gives the system storage efficiency, Retrieval and metadata management pose great challenges. In the context of big data's time, according to the guidance on the Construction of Internet Traffic Safety Integrated Service platform, In order to promote the construction of big data platform of Internet vehicle Management Institute, this paper aims at the needs of vehicle driving and management business system of Nanning vehicle Management Institute. A large amount of small file storage system, VDSMSS(Vehicle-Driving Service Mass Storage system, which is based on Hadoop distributed file system, Hadoop Distributed File File system, is constructed, which lays the foundation for big data platform of Internet vehicle Management Institute. At the same time, it also provides an effective solution and optimization direction for the design of mass small file storage system based on HDFS for the service industry. The main research contents of this paper are as follows: 1) briefly describe the core architecture of HDFS and its internal key data structure, and introduce the storage optimization schemes used by the industry in the face of a large number of small files. This paper analyzes the merits and demerits of several representative schemes, introduces several representative cache replacement algorithms, and focuses on the self-adjusting cache replacement algorithm of file system ZFS (ZFS Adjustable Replacement CacheCache ZFS-ARCU. 2) analyzes the problems existing in HDFS storage of large amount of small files. To determine the direction of optimization, summarize and expound the characteristics of small files in vehicle driving and management business system. In view of this characteristic, design and merge several small files into one large file by time business grouping, taking user as unit, So as to reduce the number of small files and reduce the amount of small file metadata to occupy NameNode memory. At the same time, we design an efficient single file lookup method and batch search index, which can take into account the speed of file retrieval. To solve the problem that HDFS does not provide prefetching and caching functions in file reading and writing, In this paper, a self-adjusting cache replacement algorithm based on file association prereading mechanism is proposed. Firstly, the traditional association rule mining algorithm is used to mine the Hadoop log files with small file access records. The extracted data is analyzed in accordance with the theory, and the potential correlation degree between small files is calculated. Secondly, an appropriate file pre-reading mechanism is designed with this correlation degree, when a small file is read, Then the associated file of the small file is preread into the cache. Then the cache replacement algorithm ZFS-ARC, which takes into account both "time" and "frequency", is combined with the pre-read mechanism designed in this paper. A self-adjusting cache replacement algorithm PRE-ZFSARCbased on file association degree prereading mechanism is improved to improve the performance of small file reading in VDSMSS. It is proved that this scheme is effective. Finally, the performance optimization of mass small file storage system is completed, which makes it highly applicable to the vehicle driving business system of the vehicle pipe station.
【学位授予单位】：广西师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP333

【相似文献】