基于Alluxio的数据高可用管理技术的研究与优化
本文选题:Alluxio + 数据管理 ; 参考:《哈尔滨工业大学》2017年硕士论文
【摘要】:随着存储硬件成本的不断降低,大数据生态系统的复杂变化,计算框架与存储系统的多样性和异构性发展,基于内存的分布式文件系统,数据库等一系列产品孕育而生,用来整合整个大数据生态系统,更好的服务于外界业务。可用性是评价海量存储系统性能的重要指标之一。本文将从提高海量存储系统可用性的角度出发,研究当前开源的基于内存的虚拟分布式存储系统Alluxio,主要研究Alluxio上关于数据管理机制的可用性优化技术,以此来提高Alluxio与底层存储相结合的海量存储系统在远程环境下的可用性。本文将Alluxio与底层存储结合的海量存储系统的可用性状态作为研究点,结合当前其他分布式文件系统或基于内存的数据库系统的一些可用性技术,分析远程环境下由于网络等不可预估因素形成的底层数据不可访问的数据不可用状态和异步存储下由于异步机制等原因形成的数据不可用现象,基于以上问题,提出了本文的优化策略,主要有两点:一是缓存预取与替换,将需要的数据预先提取保存到Alluxio上,同时增加Alluxio中热数据容量,减轻网络拥塞时的数据传输压力,减少访问底层存储次数,当底层数据不可访问时延长对外服务时间。二是优化异步存储过程,提出结合操作的异步存储优化策略,即当操作明确、具有幂等性且底层有相应计算资源时,可直接利用Alluxio向底层存储发送命令而非数据,减轻传输大量数据带来的网络压力,同时将异步与同步相结合进一步保证持久化数据的可用性。基于上述优化思想,本文提出了以下策略:基于数据块间关联规则的数据预取与替换策略和结合操作的异步存储优化策略。较为完善的解决了上述提出的问题。最后,通过实验进行了相关优化技术的综合分析。根据实验结果,得出基于关联规则的数据预取与替换策略能够在远程场景下进行数据预取,避免由于网络等原因导致的对外业务不可用,同时由于将热数据长久的保留在Alluxio中,降低了应用访问数据的延迟,减少了访问底层存储的次数,缓解了网络高负载时的通信压力,降低整个系统发生宕机情况的故障率,从而提高了系统对外业务的可用性。异步存储策略能够在异步情况下尽可能的保证数据的可用性,减轻网络传输数据的压力,同时能保证数据完整一致性等性能要求,这样既保证了程序要求的性能又保证了数据的可用性。
[Abstract]:With the decreasing cost of storage hardware, the complex changes of big data ecosystem, the diversity and heterogeneity of computing framework and storage system, and a series of products, such as memory-based distributed file system, database, etc.To integrate the entire big data ecosystem, better serve the outside world business.Availability is one of the important indexes to evaluate the performance of mass storage system.From the point of view of improving the availability of mass storage system, this paper will study the current open source virtual distributed storage system based on memory, Alluxio, and mainly study the usability optimization technology of data management mechanism on Alluxio.In order to improve the availability of mass storage system combined with Alluxio and underlying storage in remote environment.In this paper, the availability state of mass storage system combined with Alluxio and underlying storage is taken as the research point, and some usability technologies of other distributed file systems or memory-based database systems are combined.This paper analyzes the inaccessible state of the underlying data in remote environment due to the unpredictable factors such as network, and the phenomenon of data unavailability formed under asynchronous storage due to asynchronous mechanism, based on the above problems.The optimization strategy of this paper is put forward. One is to pre-fetch and replace the cache, to pre-extract and save the needed data to Alluxio, and at the same time to increase the thermal data capacity in Alluxio, and to reduce the pressure of data transmission when the network is congested.Reduce the number of access to the underlying storage, when the underlying data is not accessible to extend the external service time.The second is to optimize the asynchronous stored procedure, and put forward the asynchronous storage optimization strategy combined with the operation, that is, when the operation is clear, idempotent and there are corresponding computing resources in the bottom layer, the Alluxio can be directly used to send commands instead of data to the underlying storage.It can reduce the network pressure caused by transmitting a lot of data, and combine asynchronous and synchronization to ensure the availability of persistent data.Based on the above optimization ideas, this paper proposes the following strategies: data prefetching and replacement strategy based on association rules between blocks and asynchronous storage optimization strategy combining operations.More perfect solution to the above raised problems.Finally, the related optimization techniques are comprehensively analyzed through experiments.According to the experimental results, it is concluded that the data prefetching and replacement strategy based on association rules can prefetch data in remote scenarios, avoid the non-availability of external services caused by network, and keep hot data in Alluxio for a long time.It reduces the delay of application accessing data, reduces the times of accessing the bottom storage, alleviates the communication pressure when the network is high load, reduces the failure rate of the whole system, and improves the usability of the system's external business.Asynchronous storage strategy can ensure the availability of data as much as possible in asynchronous situation, reduce the pressure of network data transmission, and ensure the integrity of data consistency and other performance requirements.This ensures both the performance required by the program and the availability of data.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP333
【参考文献】
相关期刊论文 前9条
1 王芳;王培群;朱春节;;基于频繁序列挖掘的预取算法研究与实现[J];计算机研究与发展;2016年02期
2 吴甘沙;;大数据技术发展的十个前沿方向(中)[J];大数据;2015年03期
3 吴甘沙;;大数据技术发展的十个前沿方向(上)[J];大数据;2015年02期
4 黄立锋;邓玉辉;;可时间局部性感知的块I/O关联挖掘算法[J];小型微型计算机系统;2015年05期
5 师明;刘轶;唐歌实;;一种面向分布式文件系统的文件预取模型的设计与实现[J];计算机科学;2014年07期
6 唐颖峰;陈世平;;一种基于后缀项表的并行闭频繁项集挖掘算法[J];计算机应用研究;2014年02期
7 张荣芸;;浅析缓存预取技术[J];现代计算机(专业版);2011年13期
8 吴峰光;奚宏生;徐陈锋;;一种支持并发访问流的文件预取算法[J];软件学报;2010年08期
9 杨朝红,宫云战,桑伟前,刘海燕,李庆艳;基于主从异步复制技术的容灾实时系统研究与实现[J];计算机研究与发展;2003年07期
相关博士学位论文 前2条
1 冯懿;复杂计算机系统可用性评测技术研究[D];哈尔滨工业大学;2013年
2 吴峰光;Linux内核中的预取算法[D];中国科学技术大学;2008年
相关硕士学位论文 前3条
1 李聪;HDFS元数据管理的高可用性优化技术研究[D];哈尔滨工业大学;2016年
2 还璋武;LRFU及其自适应算法的研究[D];安徽工业大学;2016年
3 黄立锋;存储系统中突发访问行为的分析与预测[D];暨南大学;2015年
,本文编号:1769257
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1769257.html