基于节点状态的分布式文件系统存储副本分发策略的研究

发布时间：2018-03-05 18:05

本文选题：HDFS文件系统　切入点：节点选择算法　出处：《吉林大学》2013年硕士论文　论文类型：学位论文

【摘要】：如今的信息快速膨胀，人们从以前的寻找信息到现在的检索信息、筛选信息，这无不印证着信息量的庞大。对于企业和生产环境而言，虽然每天产生庞大的数据，但也要对这些大量的数据进行存储，用于以后的数据挖掘，因为挖掘出来的数据分析，最后在生产和营销环节并产生价值--这就是大数据的价值。在云计算大行其道的环境下，因为云计算独特的服务提供模式，会在云端产生大量的大数据以及用户数据，这也使得如何可靠和安全的保存这些大数据而带来了巨大的挑战。本文以云计算的大环境开始介绍，并研究学习主流的分布式存储平台，提出了基于节点状态的分布式存储副本分发策略（Node status based replication distribuion-NSRD策略）。基于节点状态的分布式存储副本分发策略从节点的状态出发，分析节点的CPU使用率、磁盘吞吐使用率、内存使用率、网络带宽使用率以及磁盘容量使用率出发，，阐述了给每个节点打分的机制（KPI），并以此KPI为基准，给文件系统写入的客户端进行合理的节点推荐。为了更好的阐述基于节点状态的分布式存储副本分发策略，本文把此策略抽象成模型，并分成三个服务来进行阐述。这三个服务有节点状态获取服务、状态信息转发服务、目标节点选择服务。为了更好的阐述组成基于节点状态的分布式存储副本分发策略的三个服务，本文结合HDFS文件系统，并在HDFS上的工作原理为依据阐述了基于节点状态的分布式存储副本分发策略的必要性。通过在第3章和第4章中的分析研究得知，先今流行的分布式文件系统都不约而同的选择了把大文件分片存储的方式，在做分片存储时每个文件系统的控制节点需要给客户端提供目标节点的推荐。但是控制节点在给客户端推荐集群中的存储节点时往往采用了Roun-Ronbin随机抽选策略。虽然这种策略简单易实现，但是因为没有充分考虑的整个集群中节点的CPU使用、内存使用率、磁盘吞吐使用率、网络带宽使用和磁盘空间使用率，会导致寻找的目标节点负载过高、磁盘容量吃紧的负面作用。为了更好的解决上述问题，基于节点状态的分布式存储副本分发策略通过节点状态获取服务让存储节点准确实时的获取自己的状态，并通过状态信息转发服务转发给集群中的控制节点，最后控制节点通过目标节点选择服务进行对每个节点的打分，最终把KPI值最高的节点信息返回给客户端。为了证明NSRD策略的可实现性，本文通过改进HDFS文件系统的副本分发策略，并将NSRD的节点状态获取服务、状态信息转发服务、目标节点选择服务三个服务集成到HDFS文件系统中，并对其进行分不同场景下的实验。在实验室的环境下无法模拟出大规模的集群环境，所以本文通过MATLAB来仿真模拟的NSRD策略和HDFS自带的默认策略进行比较，分析其传输效率以及传输稳定性。由于分布式文件系统的分发机制还处于研究阶段，很多分布式文件系统都没有集成智能的分发机制，所以本以抛砖引玉的方式，提出通过节点状态来决定最终的存储目标节点的方法。因为本文中各个Node的KPI值估值算法里的权值是通过一种实验方式获得并确定，所以需要在今后的工作中使用多种不同的实验来使权值更加精确。在日后的工作中如有完整的数据节点评分数据集时也可以考虑加入机器学习以及预测的方法来使节点选择策略更加的全面。
[Abstract]:Today, the rapid expansion of information, people find information from now to retrieve information, screening information previously, which confirms the huge amount of information for enterprises and production environment, although the huge amount of data generated every day, but also to a large number of these data are stored for subsequent data mining, because of dig out the data, finally in the production and marketing process and produce value -- this is the value of big data. In the popular cloud computing environment, because cloud computing unique service mode, will be a lot of big data and user data in the cloud, and it also brings great challenge to the big data storage how reliable and safe. Based on the cloud computing environment began to introduce and study the distributed storage platform mainstream, is proposed based on the distributed node state storage copies thereof (Node status based replication strategy distribuion-NSRD strategy). Distributed storage replica node state starting from node distribution strategy based on state analysis, node CPU usage, disk throughput usage, memory usage, network bandwidth usage and disk capacity utilization, and expounded the mechanism to each node (KPI). And the KPI is used as a benchmark, to write the file system client node reasonable recommendation. In order to explain the distributed storage replica node based on the state of distribution strategy, the strategy of abstract model, and divided into three service to carry on the elaboration. The three service node status obtain service state information forwarding service, the target node selection service.
In order to explain the composition of the three service delivery strategy of distributed storage replica nodes based on state, combined with the HDFS file system, and the working principle of HDFS as the basis for the necessity of distribution strategy of distributed storage replica nodes based on state. Through the analysis in the third and fourth chapter that distributed file the system now popular are invariably chose to file slice storage way in control node slice storage when each file system needs to provide the client the target node's recommendation. But the control node in the recommendation to the client storage nodes in the cluster are often used Roun-Ronbin random selection strategy. Although this the strategy is simple and easy to implement, but because did not fully consider the entire node in the cluster CPU, memory usage, disk throughput rate, network bandwidth The use of and disk space usage can lead to the negative effect of the overloading of the target node and the tight disk capacity.
In order to solve the above problems, distributed storage replica node state distribution strategy to obtain services through the node status for the storage node accurate real-time access to the state based on the state information and the forwarding service is transmitted to the control nodes in the cluster, finally the control node selects a service to each node by scoring the target node, the KPI the highest returns the node information to the client.
In order to prove that the implementation of NSRD strategy, this paper improved the HDFS file system copy distribution strategy, and node state NSRD access service, information forwarding service, the target node selection service three services are integrated into the HDFS file system, and not the same scenario experiments to simulate clusters on it. The large-scale environment in the laboratory environment, so this paper uses MATLAB to simulate the default strategy NSRD strategy and HDFS own comparison, analysis of stability of the transmission and transmission efficiency.
The distribution mechanism of distributed file system is still in the research stage, many distributed file systems are not integrated the intelligentdistribution mechanism, so the way to start, put forward methods to determine the final storage destination node by node state. Because each Node the KPI value valuation algorithm weights is through an experiment to obtain and identify, so we need to work in the future to use a variety of different experimental weight more accurate. In the days after work if the data nodes complete score data set can also be considered when adding machine learning and prediction method to the node selection strategy is more comprehensive.

【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333;TP316.4

【相似文献】