跨项目软件缺陷预测方法研究综述

发布时间：2018-05-28 03:38

本文选题：经验软件工程 + 软件缺陷预测　；参考：《计算机学报》2018年01期

【摘要】：软件缺陷预测首先通过挖掘与分析软件历史仓库,从中抽取程序模块并进行类型标记.随后通过分析软件代码的内在复杂度或开发过程特征,设计出与软件缺陷存在强相关性的度量元,并对这些程序模块进行度量.最后借助特定的机器学习方法基于上述数据构建出缺陷预测模型.因此该方法可以在项目开发的早期阶段,通过预先识别出项目内的可疑缺陷模块,达到优化测试资源分配的目的.但在实际软件开发场景中,需要进行缺陷预测的项目可能是一个新启动项目,或这个项目的历史训练数据比较稀缺.一种简单的解决方案是利用其他项目已经搜集的训练数据来构建缺陷预测模型.但不同项目之间因所处的应用领域、采用的开发流程、使用的编程语言、开发人员经验等并不相同,因此对应数据集间会存在较大的分布差异性并造成该方案的实际性能并不理想,因此如何通过有效迁移源项目的相关知识来为目标项目构建预测模型,吸引了国内外研究人员的关注,并将该问题称为跨项目软件缺陷预测问题.论文针对该问题进行了系统综述.根据预测场景的不同,将已有方法分为3类:基于有监督学习的方法、基于无监督学习的方法和基于半监督学习的方法.其中基于有监督学习的方法主要基于候选源项目集的程序模块来构建模型.这类方法根据源项目与目标项目采用的度量元是否相同又可以细分为同构跨项目缺陷预测方法和异构跨项目缺陷预测方法.针对前者,研究人员主要从度量元取值转换、实例选择和权重设置、特征映射和特征选择、集成学习、类不平衡学习等角度展开研究.而后者更具研究挑战性,研究人员主要基于特征映射和典型相关分析等方法展开研究.基于无监督学习的方法直接尝试对目标项目中的程序模块进行预测.这类方法假设在软件缺陷预测问题中,有缺陷模块的度量元取值存在高于无缺陷模块的度量元取值的倾向.因此研究人员主要基于聚类方法展开研究.而基于半监督学习的方法则会综合使用候选源项目集的程序模块和目标项目中的少量已标记模块来构建模型.这类方法通过尝试从目标项目中选出少量模块进行标记,以提高跨项目缺陷预测的性能.研究人员主要借助集成学习和TrAdaBoost方法展开研究.论文依次对每一类方法的已有研究成果进行了系统梳理和点评.随后论文进一步总结了跨项目缺陷预测研究中经常使用的性能评测指标和评测数据集,其统计结果可以辅助研究人员针对该问题进行合理的实验设计.最后总结全文,并分别从数据集搜集、数据集预处理、模型构建和评估、模型应用这4个维度对未来值得关注的研究方向进行了展望.
[Abstract]:Firstly, software defect prediction is done by mining and analyzing the software history warehouse, extracting the program module from it and marking the software type. Then, by analyzing the inherent complexity of software code or the characteristics of the development process, a measure element with strong correlation with software defects is designed, and these program modules are measured. Finally, a defect prediction model based on the above data is constructed with the help of specific machine learning methods. Therefore, this method can be used to optimize the allocation of test resources in the early stage of project development by pre-identifying the suspicious defect modules in the project. However, in the actual software development scenario, the project requiring defect prediction may be a newly started project, or the historical training data of the project may be scarce. A simple solution is to build defect prediction models using training data already collected by other projects. However, because of the application field, the development process, the programming language used, the experience of the developer and so on, the different projects are not the same. Therefore, there is a large distribution difference between the corresponding data sets and the actual performance of the scheme is not ideal. Therefore, how to build a prediction model for the target project by effectively migrating the relevant knowledge of the source project. It has attracted the attention of researchers at home and abroad and called this problem a cross-project software defect prediction problem. A systematic review of the problem is given in this paper. According to the different prediction scenarios, the existing methods are divided into three categories: one based on supervised learning, another based on unsupervised learning and one based on semi-supervised learning. The method based on supervised learning is mainly based on the program module of candidate source itemset to construct the model. This kind of method can be subdivided into isomorphic cross-project defect prediction method and heterogeneous cross-project defect prediction method according to whether the measure elements used in source project and target project are the same or not. For the former, the researchers mainly focus on the measurement element conversion, case selection and weight setting, feature mapping and feature selection, integrated learning, class imbalance learning and so on. The latter is more challenging, and the research is mainly based on feature mapping and canonical correlation analysis. Based on the unsupervised learning method, we try to predict the program module in the target project directly. This method assumes that in the software defect prediction problem, the metric element of the defective module tends to be higher than that of the non-defect module. So the research is mainly based on clustering method. The method based on semi-supervised learning combines the program modules of candidate source itemsets and a small number of tagged modules in target projects to construct the model. This method can improve the performance of cross-project defect prediction by selecting a small number of modules from the target project to mark. The researchers mainly use integrated learning and TrAdaBoost method to carry out the research. In this paper, the existing research results of each kind of methods are systematically combed and reviewed in turn. Then the paper summarizes the performance evaluation indexes and data sets which are often used in the research of cross-project defect prediction. The statistical results can be used to assist the researchers to carry out reasonable experimental design for this problem. Finally, this paper summarizes the whole paper, and looks forward to the future research direction from the four dimensions of data collection, data set preprocessing, model construction and evaluation, and model application.
【作者单位】：南通大学计算机科学与技术学院;南京大学计算机软件新技术国家重点实验室;天津大学软件学院;
【基金】：国家自然科学基金(61202006,61202030,61373012,61402244,61602267) 南京大学计算机软件新技术国家重点实验室开放课题(KFKT2016B18) 江苏省高校自然科学研究项目(15KJB520030,16KJB520038)资助~~
【分类号】：TP311.53

【相似文献】