分布式文件系统数据读写流程分析与优化

发布时间：2018-08-14 14:54

【摘要】：大数据时代存储系统在众多实际应用中扮演越来越重要的角色，其读写性能的好坏直接影响上层应用的性能。目前，分布式文件系统都是利用扩展性支持不断攀升的性能需求，但规模扩大易导致成本增加、维护困难。虽然基于对象的文件系统利用了存储设备的智能性，，但却忽视了存储系统中所有组件是一个有机的整体。存储系统性能好坏的关键在于能否充分发挥系统中各个节点的优势和充分利用节点间的互联网络。着重研究了存储系统中的数据读写流程，并对影响系统性能的关键步骤进行了优化。所做工作全部在实验室研发的基于对象的分布式文件系统Cappella中实现并完成测试。针对数据写流程，设计并实现了根据存储服务器实时负载的动态布局方案。每个存储服务器都有一个实时权重表示其忙闲程度，在文件布局时，根据所有存储服务器的实时负载进行有偏重的随机选择，成功地解决了Cappella系统静态布局容易造成负载不均衡的问题。针对数据读流程，详细分析了Linux内核原有数据预取算法，针对Linux原有数据预取算法的缺点，设计并实现了一种适用于分布式环境的数据预取策略。Linux中的预取算法是针对本地文件系统和磁盘作为存储设备的限制提出的，在分布式环境中显得不足。分布式环境下数据分布在通过专用高速网络互联的多个节点中，因此节点间的互联网络和数据在多个节点上的分布方式成为优化系统性能的关键，分布式环境下的预取算法综合考虑了网络传输的限制和数据分布的特点，有效地提升了系统性能。测试结果表明，数据能在各个存储服务器上按服务器权重合理分布，读带宽在顺序访问和大块的随机访问情况下可以提高30%以上，最高近90%。
[Abstract]:In the era of big data, storage system plays an increasingly important role in many practical applications, and its reading and writing performance directly affects the performance of upper application. At present, distributed file systems are always using extensibility to support increasing performance requirements, but the expansion of scale can easily lead to increased costs and difficult maintenance. Although the object-based file system takes advantage of the intelligence of the storage device, it ignores that all the components in the storage system are an organic whole. The key to the performance of the storage system lies in whether it can give full play to the advantages of each node in the system and make full use of the Internet between the nodes. The data reading and writing process in storage system is studied, and the key steps that affect the system performance are optimized. All the work is implemented and tested in the object-based distributed file system (Cappella) developed in the laboratory. According to the data writing process, the dynamic layout scheme based on the real-time load of storage server is designed and implemented. Each storage server has a real-time weight to indicate its busy degree. In the file layout, it is selected randomly according to the real-time load of all storage servers. The problem of load imbalance caused by static layout of Cappella system is solved successfully. According to the data reading process, this paper analyzes the original data prefetching algorithm of Linux kernel in detail, and aims at the shortcoming of Linux original data prefetching algorithm. This paper designs and implements a data prefetching strategy for distributed environment. The prefetching algorithm in Linux is aimed at the limitation of local file system and disk as storage device, which is insufficient in distributed environment. In the distributed environment, the data is distributed among multiple nodes interconnected by a dedicated high-speed network, so the internetwork between nodes and the distribution of data on multiple nodes become the key to optimize the performance of the system. The prefetching algorithm in distributed environment takes into account the limitations of network transmission and the characteristics of data distribution, and improves the performance of the system effectively. The test results show that the data can be distributed reasonably according to the server weight on each storage server, and the read bandwidth can be increased by more than 30% in the case of sequential access and large random access, and the maximum is nearly 90%.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】