基于监督学习的开源平台软件开发行为研究

发布时间：2019-04-10 06:35

【摘要】：自二十世纪末以来,蓬勃发展的开源软件正在逐步挑战着传统专有软件占主导地位的软件产业格局,逐渐增多的开源软件的出现对软件产业的市场结构产生了巨大的影响。分布式开发模型也在随着开源软件开发需求的转变而逐步发展,而基于拖拽式的分布式开发模型的出现引领了一种新的关于分布式软件开发模式的发展方向。对开源开发中的开发行为特征的研究是软件演化领域的研究热点,可以帮助开发者更深刻地理解软件演化进程中的规律,从而改进现存的软件开发过程。随着越来越多的开发人员参与到开源软件开发中,一些代码托管平台,例如GitHub和BitBucket,逐步开始为分布式软件开发提供相应的支持。在对GitHub上的开发行为进行分析时,需要对海量的关系松散的数据进行处理,而想要获得其中的深度价值往往需要通过包括机器学习等智能化复杂分析。本文对挂载在GitHub上的使用基于拖拽式开发模式的开源项目进行分析,发掘出在该模式下开发流程周转、外部贡献接纳以及处理外部贡献的时间等规律。分析开发人员的开发动作行为,并且根据不同的开发行为特征对贡献最后能否被接纳的影响力大小去构建预测模型,来预测一个外部贡献能否最终被采纳。在对行为特征进行提取时,考虑加入基于历史记录的行为特征,对构建预测模型所需的特征集合进行了有效的补充。本文构建的预测模型要解决的是对拖拽式请求的最终状态进行分类的问题,将采用适用大规模数据监督学习算法(支持向量机)来实现大规模数据的分类。本文将会对所选取的预测模型的表现进行对比,在选择合适的预测模型上进行研究,并将针对现存的SVM算法,在核函数参数优化的过程中存在着计算量过大,学习性能以及识别率不够高等问题加以改进,最后对预测模型对于数据拟合化的探讨。本文的创新研究内容如下:1.研究开源系统中拖拽式请求的接受策略,本文通过对机器学习常见算法分类器对GitHub海量数据特征值进行选取和分类,由于考虑到了测试部分与基于历史数据的行为特征,在特征集合中引入测试覆盖、人员历史成功提交请求率以及项目历史成功接纳请求率因素,对特征值集进行有效扩充。2.为了提升网格搜索效率,本文对网格搜索算法的穷举模式进行改进,并应用到了预测模型的构建中,提出一种基于模式搜索与网格搜索算法相结合的网格探测参数选择算法(GDPS)。对构建预测模型运用的SVM核函数的最优参数对进行选择,提升SVM算法学习性能和识别率,从而得到一个准确率更高的预测模型。
[Abstract]:Since the end of the 20th century, the booming open source software is gradually challenging the traditional proprietary software dominant software industry pattern, the emergence of gradually increasing open source software has a great impact on the market structure of the software industry. The distributed development model is gradually developing with the change of open source software development requirements, and the appearance of the drag-and-drop distributed development model leads to the development direction of a new distributed software development model. The research on the characteristics of development behavior in open source development is a hot topic in the field of software evolution, which can help developers to understand the law of software evolution more deeply and improve the existing software development process. As more and more developers are involved in open source software development, some code-managed platforms, such as GitHub and BitBucket, have gradually begun to provide appropriate support for distributed software development. When analyzing the development behavior on GitHub, it is necessary to deal with a large amount of loose data, and in order to obtain the depth value, it is often necessary to use intelligent and complex analysis, such as machine learning, and so on. In this paper, the open source projects based on drag-and-drop development model mounted on GitHub are analyzed, and the rules of development process turnover, external contribution acceptance and processing time of external contribution are found out. This paper analyzes the developer's development action behavior and constructs a prediction model according to the influence of different development behaviors on the final acceptance of the contribution to predict whether an external contribution can eventually be adopted. In the process of extracting behavior features, we consider adding history-based behavior features to effectively complement the set of features needed to construct the prediction model. In this paper, the prediction model is to solve the problem of classification of the final state of drag-and-drop requests, and a large-scale data supervised learning algorithm (support vector machine) will be used to realize the classification of large-scale data. In this paper, the performance of the selected prediction model will be compared, the selection of a suitable prediction model will be studied, and according to the existing SVM algorithm, there will be too much computation in the process of parameter optimization of the kernel function. Some problems such as learning performance and low recognition rate are improved. Finally, the prediction model for data adaptation is discussed. The innovative research contents of this paper are as follows: 1. This paper studies the acceptance strategy of drag-and-drop requests in open source systems. This paper selects and classifies the eigenvalues of GitHub massive data by machine learning common algorithm classifiers, considering the behavior characteristics of the test part and historical data. The feature set is effectively extended by introducing test coverage, human history successful submission request rate, and project historical success acceptance request rate factor into the feature set. 2. In order to improve the efficiency of grid search, this paper improves the exhaustive pattern of grid search algorithm and applies it to the construction of prediction model. A grid detection parameter selection algorithm (GDPS). Based on the combination of pattern search and grid search algorithm is proposed in this paper. The optimal parameter pairs of the SVM kernel function used to construct the prediction model are selected to improve the learning performance and the recognition rate of the SVM algorithm so as to obtain a prediction model with higher accuracy.
【学位授予单位】：哈尔滨工程大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.52

【参考文献】