基于迁移学习和PU学习的软件故障预测方法研究

发布时间：2019-01-03 14:10

【摘要】：随着人工智能的不断发展,机器学习技术已被应用于软件故障预测中,传统基于机器学习的软件故障预测需要大量已标注样本进行模型构建。而现实中,已标注软件故障数据往往通过人工测试后获取,费时费力代价高昂。为了降低传统软件故障预测方法在有监督学习场景下对标注样本的需求,本文从正例未标注学习(Positive and Unlabeled Learning,PU学习)和迁移学习两方面展开研究,提出针对PU场景下,通过对跨公司、跨项目正例未标注故障数据进行知识迁移,对目标故障样本进行预测,具体工作如下:(1)PU场景下基于随机森林的实例迁移算法(POSTRF算法)该算法在PU场景下,基于贝叶斯跨类迁移思想,将待预测样本视为目标领域数据集,将跨公司、跨项目的软件故障样本视为源领域数据集,对源领域数据集进行有放回抽样训练得到多棵PU随机决策树,根据对目标领域数据测试得到的AUC值及采样集样本计算样本权重,通过迁移与目标领域数据具有相似分布的样本与目标领域数据共同构建PU数据集,基于POSC4.5算法构建模型来对目标领域的软件故障样本进行预测。算法首先对源领域数据集以bagSize比例进行有放回抽样得到M份采样集并训练M棵PU随机决策树,从目标领域中随机抽取75%样本作为测试集对M棵随机决策树进行分类测试,将每棵树的AUC值(Area Under the ROC Curve)作为各树权重,根据树权重对采样集样本加权,将采样集样本权重合并得到最终样本权重,以迁移比r迁移权重较高样本完成实例迁移。对迁移样本和目标领域数据集基于完全随机假设构建PU数据集,以正例样本数、未标注样本数和正例先验概率计算属性的不确定信息增益,通过选择最大不确定信息增益属性为分支节点,自上而下递归生成树模型,对目标领域故障样本进行预测。(2)针对POSTRF算法实验将NASA数据库的8个软件故障数据集作为实验数据集,分别以0kc3、cm1数据集作为目标领域数据集,其余数据集作为源领域数据集,将本文的算法与POSC4.5算法进行对比实验结果表明,POSTRF算法在0kc3和cm1目标集上通过迁移其他辅助集实例样本,提升了模型分类性能,且AUC值提高了约3%-12%,故障预测率PD提高了约5%。因此,本文提出的POSTRF算法通过对跨项目、跨公司软件故障数据进行知识迁移,与传统PU学习算法相比对目标领域故障样本具有相当或更好的预测性能。
[Abstract]:With the continuous development of artificial intelligence, machine learning technology has been applied to software fault prediction. Traditional software fault prediction based on machine learning requires a large number of labeled samples for modeling. In reality, tagged software fault data are often acquired by manual testing, which is time-consuming and costly. In order to reduce the requirement of traditional software fault prediction methods for labeled samples in supervised learning scenarios, this paper studies the two aspects of positive unannotated learning (Positive and Unlabeled Learning,PU learning and migration learning, and proposes a new approach for PU scenarios. Through knowledge transfer of cross-company, cross-project unannotated fault data, the target fault samples are predicted. The main works are as follows: (1) in PU scenario, the instance migration algorithm based on stochastic forest (POSTRF algorithm). Under the PU scenario, based on Bayesian idea of cross-class migration, the sample to be predicted is regarded as the target domain data set, which will be cross-company. The software fault samples of cross-project are regarded as source domain data sets. The source domain data sets are trained with backward-back sampling to obtain multiple PU random decision trees. The sample weights are calculated according to the AUC values obtained from the test of the target domain data and the samples from the sample sets. The PU data set is constructed by migrating samples with similar distribution to target domain data and building model based on POSC4.5 algorithm to predict software fault samples in target domain. Firstly, M samples are collected by bagSize scale and M PU random decision trees are trained, and 75% samples are randomly extracted from the target domain as test sets to classify M random decision trees. The AUC value (Area Under the ROC Curve) of each tree is taken as the weight of each tree, the sample weight of the sample set is weighted according to the tree weight, and the final sample weight is obtained by combining the sample weight of the sample set, so that the sample with higher migration weight than r is used to complete the sample migration. Based on the complete random assumption, the PU data set is constructed for migrating samples and target domain data sets. The uncertain information gain of attributes is calculated with positive sample number, unlabeled sample number and positive prior probability. By selecting the maximum uncertain information gain attribute as the branch node, the top-down recursive tree model is generated. The target domain fault samples are predicted. (2) eight software fault data sets of NASA database are used as experimental data sets, and 0kc3cm1 data sets are used as target domain data sets respectively. The other data sets are used as source domain data sets. The experimental results show that the POSTRF algorithm improves the classification performance of the model by migrating the sample samples of other auxiliary sets on the 0kc3 and cm1 target sets by comparing the proposed algorithm with the POSC4.5 algorithm. The AUC value increased about 3-12 and the fault prediction rate PD increased about 5%. Therefore, the proposed POSTRF algorithm has comparable or better prediction performance to the target domain fault samples than the traditional PU learning algorithm through knowledge migration of cross-project and cross-company software fault data.
【学位授予单位】：西北农林科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.53

【参考文献】