基于监督学习的bug报告和源代码摘要

发布时间：2019-06-06 09:28

【摘要】：开发者在执行软件任务时,需要与软件工件如bug报告、源代码仓库等进行交互,为了获取所需要的信息,也许需要彻底地通读整个工件。然而,从bug报告和源代码中提取有价值的信息是一项十分繁琐且耗时的任务。为了高效地求解这个任务,研究者建议为软件工件自动化地建立摘要信息。在本文,为了方便开发者从bug报告和源代码仓库中高效地提取所需要的信息,我们提出使用有监督的学习技术来建立摘要信息。我们使用重复的bug报告来建立bug报告摘要信息,作为自然语言文本摘要任务的一个实例。在另一个调研中,我们执行源代码片段摘要,作为源代码到源代码摘要任务的一个实例。对于bug报告,我们开发了一种基于PageRank的bug报告摘要算法(Page Rank based Summarization Technique),简称为PRST。该算法使用三种不同的相似度度量方法,分别基于VSM.Jaccard和WordNet,来计算主bug报告和对应的重复的bug报告之间的相似度。由于公共可用的bug报告语料库中缺乏主bug报告和重复bug报告的对应关系,无法利用重复bug报告中包含的信息来执行bug报告摘要任务。因此,我们从Mozilla、KDE、Gnome和Eclipse项目中抽取出59个bug报告并建立了一个独立的bug报告语料库,称为OSCAR.同时,我们通过增加重复的bug报告来重构已有的BRC语料库,并将其作为对比语料库。我们采用几种先进的统计评价指标,即精度(Precision)、召回率(Recall),F-Score 和 Pyramid Precision,外在地评价所提出的算法的有效性。结果显示我们提出的算法能够获得相对准确的bug报告摘要信息,并且,提高了已有的有监督的bug报告和精度。同样地,为了建立源代码摘要信息,我们开发了一种基于SVM和NB分类器的代码片段摘要算法(CodeFragment Summarization,CFS)自动生成源代码片段中源到源摘要信息。在软件工件摘要范式中,我们首次引入了基于数据驱动的小规模的众包方法来帮助我们抽取源代码句法特征。我们从Eclipse 和 NetBeans官方FAQs中检索到127个代码片段并构建一个用于测试的代码片段语料库。我们同样采用先前提到的统计评价指标并比较已有的方法来验证我们提出的方法的有效性。结果显示我们的代码片段摘要器在精度上超过已有的代码片段摘要生成方法,同时句法特征对生成的摘要信息上的准确度有着重要的影响。生成的摘要信息能够有效地帮助开发者解决在手的软件任务,并有效地改善软件的性能和质量。
[Abstract]:When performing software tasks, developers need to interact with software artifacts such as bug report, source code warehouse and so on. In order to obtain the required information, they may need to read through the whole artifact thoroughly. However, extracting valuable information from bug reports and source code is a tedious and time-consuming task. In order to solve this task efficiently, the researchers suggest that summary information be established automatically for software artifacts. In this paper, in order to facilitate developers to extract the required information efficiently from bug reports and source code warehouses, we propose to use supervised learning technology to establish summary information. We use duplicate bug reports to create bug report summary information as an example of a natural language text summary task. In another study, we performed a source code fragment summary as an example of the source code to source code summary task. For bug report, we develop a bug report summary algorithm based on PageRank, which is called PRST. for short. In this algorithm, three different similarity measures are used to calculate the similarity between the main bug report and the corresponding repeated bug report based on VSM.Jaccard and WordNet, respectively. Due to the lack of the corresponding relationship between the main bug report and the repeated bug report in the publicly available bug report corpus, it is impossible to use the information contained in the duplicate bug report to perform the bug report summary task. Therefore, we extracted 59 bug reports from the Mozilla,KDE,Gnome and Eclipse projects and established a separate bug report corpus called OSCAR. At the same time, we reconstruct the existing BRC corpus by adding repeated bug reports and use it as a comparative corpus. We use several advanced statistical evaluation indexes, namely precision (Precision), recall (Recall), F-Score and Pyramid Precision, to evaluate the effectiveness of the proposed algorithm. The results show that the proposed algorithm can obtain relatively accurate summary information of bug report, and improve the existing supervised bug report and accuracy. Similarly, in order to establish source code summary information, we develop a code fragment summary algorithm based on SVM and NB classifiers (CodeFragment Summarization,CFS) to automatically generate source-to-source summary information in source code fragments. In the software artifact summary paradigm, we first introduce a data-driven small-scale crowdsourcing method to help us extract the syntactic features of the source code. We retrieve 127 code fragments from Eclipse and NetBeans official FAQs and build a code fragment corpus for testing. We also use the statistical evaluation indicators mentioned earlier and compare the existing methods to verify the effectiveness of our proposed method. The results show that our code fragment extractor is more accurate than the existing code fragment summary generation methods, and syntactic features have an important impact on the accuracy of the generated summary information. The generated summary information can effectively help developers solve the software tasks in hand, and effectively improve the performance and quality of the software.
【学位授予单位】：大连理工大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.5;TP391.1

【相似文献】