面向超级计算机的自适应故障预测算法研究

发布时间：2018-06-21 20:52

本文选题：系统容错 + 超级计算机　；参考：《重庆大学》2014年硕士论文

【摘要】：随着信息技术的发展，云计算等大型分布式系统开始广泛投入部署和应用。然而随着应用系统软硬件复杂性的增加，如何保证系统能够长时间正确运行，为广大用户提供高质量服务，成为了大型系统设计开发过程中需要考虑的问题。大型系统如果能够通过故障预测策略实现自我诊断，那么其容错能力和资源调度能力就能得到很大的提升，从而保证系统的高可用性和高可靠性。超级计算机拥有复杂的计算机系统，针对超级计算机的故障预测研究对于提高超级计算机的运算性能和系统容错能力具有重要意义，并且有效的故障预测策略也可以应用于其它大型系统中，以此提高这些系统的容错能力。本文以超级计算机的系统运行日志为基础，首先设计并实现了基于语义和时间相关的过滤算法(Semantic Time Filter Algorithm,简记STF)，对日志记录进行预处理。STF算法考虑日志记录之间的语义相关度和时间相关度，根据两个相关度对原始日志记录中的冗余记录进行过滤。通过实验发现，过滤后的日志记录序列能够有效地反映系统中非故障事件到故障事件的演变过程，对于后续分析并建立故障预测模型有很大帮助。通过对过滤后的日志记录进行分析，本文运用数据挖掘中的分类预测思想，将时间轴划分为一定大小的时间窗，针对时间窗进行特征提取，以时间窗为单位进行故障预测。本文使用AdaBoost算法在SVM分类器的训练学习过程中，根据训练集动态调整分类器核心参数，使分类器进行自适应学习提升，建立了自适应故障预测模型AdaBoostSVM。本文以超级计算机BlueGene/L215天的系统运行日志为实验数据集，经过预处理后，在该数据集上进行预测模型的对比实验。实验结果表明：本文的AdaBoostSVM模型较基于故障记录之间时间间隔(Time Between Failure TBF)、基于kNN、RIPPER以及SVM的故障预测模型具有更好的分类预测性能，特别是在故障预测中的重要指标召回率方面，自适应故障预测模型AdaBoostSVM的召回率要高出其它预测模型10%-20%。
[Abstract]:With the development of information technology, cloud computing and other large-scale distributed systems have been widely deployed and applied. However, with the increasing complexity of the software and hardware of the application system, how to ensure that the system can run correctly for a long time and provide high quality service for the majority of users has become a problem to be considered in the process of large-scale system design and development. If a large system can diagnose itself by fault prediction strategy, its fault-tolerant ability and resource scheduling ability can be greatly improved, thus ensuring the high availability and high reliability of the system. Supercomputers have complex computer systems. The study of fault prediction for supercomputers is of great significance to improve the performance of supercomputers and the fault tolerance of systems. Effective fault prediction strategies can also be applied to other large systems to improve their fault tolerance. This paper is based on the system running log of supercomputer, Firstly, a filtering algorithm based on semantic and temporal correlation is designed and implemented, which is abbreviated to STF. The preprocessing. STF algorithm considers the semantic correlation and time correlation between log records. The redundant records in the original log records are filtered according to the two correlations. It is found through experiments that the filtered logging sequence can effectively reflect the evolution process from non-fault events to fault events in the system, which is of great help to the subsequent analysis and the establishment of fault prediction models. Based on the analysis of filtered log records, this paper uses the idea of classification and prediction in data mining, divides the time axis into time windows of a certain size, extracts features from time windows, and makes fault prediction based on time windows. In this paper, the AdaBoost algorithm is used in the training process of SVM classifier. According to the dynamic adjustment of the kernel parameters of the classifier, the classifier is promoted by adaptive learning, and an adaptive fault prediction model, AdaBoostSVM, is established. In this paper, the system running log of the supercomputer BlueGeneR / L 215 days is taken as the experimental data set. After preprocessing, the prediction model is compared on the data set. The experimental results show that the proposed AdaBoost SVM model has better classification performance than that based on time interval between fault records and between time between fault records, kNNNNNRIPPER and SVM, especially on the recall rate of important indexes in fault prediction. The recall rate of adaptive fault prediction model AdaBoostSVM is higher than that of other prediction models.
【学位授予单位】：重庆大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP338

【参考文献】