缺陷定位软件库挖掘:修正向量空间模型与预训练词嵌入的比较分析
发布时间:2023-06-04 05:07
软件仓库挖掘领域可以分析软件仓库中的数据以便促进软件的开发过程。虽然版本控制系统、缺陷跟踪系统、通信档案、设计要求和文档中存在大量数据,但是由于其高度非结构化性质,研究人员在将其用于分析时仍面临挑战。从事软件仓库挖掘的人员试图解决的任务之一是缺陷定位。定位源代码中的缺陷是很困难的。众所周知,手动缺陷定位的过程是乏味且困难的,因此开发人员会花费大量的时间在这上。缺陷定位的目的在于能够基于缺陷报告自动识别有缺陷的源代码文件。即使有大量的自动化技术,该领域尚未充分发挥其潜力并实现商业化。因此,自动缺陷定位仍然是一个悬而未决的问题,研究团体对此表现出了极大的兴趣。随着最近自然语言处理领域的发展,许多用于将单词嵌入向量中的模型已经被提出。它们是基于分布假设,即单词含义的相似度由它们在向量空间中的相似度表示。这种模型允许我们通过观察单词向量表示之间的距离来度量单词之间的语义相似性。本文结合信息检索模型,探讨了词嵌入预训练模型在缺陷定位中的有效性。通过使用不同的预处理技术,提出的模型由检索与所分析的缺陷报告相关的源代码文件的排序列表的能力进行评估。缺陷定位能够处理具有非结构化性质的数据,如缺陷报告、...
【文章页数】:71 页
【学位级别】:硕士
【文章目录】:
Abstract
摘要
Chapter1 Introduction
1.1 Introduction to Mining Software Repositories
1.2 Background research and objectives
1.2.1 Research Objectives and contribution of the thesis
1.2.2 Background research
1.2.3 Motivation
1.3 Literature Review and Analysis
Chapter2 Theoretical Background
2.1 Version Control Systems
2.1.1 Source Forge
2.1.2 Git Hub
2.2 Bug tracking systems
2.2.1 Bugzilla
2.3 Information retrieval
2.3.1 Common terminology
2.4 Commonly used IR models
2.4.1 Vector Space Model(VSM)
2.4.2 Revised Vector Space Model(r VSM)
2.4.3 Latent Semantic Indexing(LSI)
2.4.4 Probabilistic Latent Semantic Indexing(PLSI)
2.4.5 Latent Dirichlet Allocation
2.5 Word embeddings
2.5.1 Vector space model and statistical language model
2.5.2 Representing text with embeddings
2.5.3 Types of word embeddings
2.6 Abstract Syntax Trees
2.7 Summary
Chapter3 Bridging the Lexical Gap
3.1 Pretrained Word Embedding Models
3.1.1 word2Vec model trained on Stack Overflow posts
3.1.2 Fast Text model trained on Common Crawl
3.1.3 Glo Ve model trained on Common Crawl
3.1.4 fast Text model trained on source code files
3.2 Types of similarity
3.2.1 Lexical similarity
3.2.2 Semantic similarity
3.3 Similarity measures
3.3.1 Cosine similarity
3.3.2 Word Mover distance
3.4 Objective Function and Optimization
3.4.1 Differential evolution
3.5 Structure of the model
3.6 Summary
Chapter4 Experimental Setup And Results
4.1 Data collection
4.2 Parsing and preprocessing
4.2.1 Tokenization and linguistic preprocessing of tokens
4.3 Experiments with different preprocessing techniques
4.3.1 Embedding whole content of source files
4.3.2 Parsing ASTs of source code files
4.4 Experiments with different pretrained vectors
4.5 Evaluation
4.6 Results
4.6.1 fast Text vectors trained on Common Crawl data
4.6.2 Glo Ve vectors trained on Common Crawl data
4.6.3 Word2Vec vectors trained on Stack Overflow data
4.7 Comparison with other models
4.7.1 Comparison with the base r VSM model
4.7.2 Comparison of the proposed model with Bug Locator
4.8 Summary
Conclusion
References
Acknowledgements
Resume
本文编号:3830746
【文章页数】:71 页
【学位级别】:硕士
【文章目录】:
Abstract
摘要
Chapter1 Introduction
1.1 Introduction to Mining Software Repositories
1.2 Background research and objectives
1.2.1 Research Objectives and contribution of the thesis
1.2.2 Background research
1.2.3 Motivation
1.3 Literature Review and Analysis
Chapter2 Theoretical Background
2.1 Version Control Systems
2.1.1 Source Forge
2.1.2 Git Hub
2.2 Bug tracking systems
2.2.1 Bugzilla
2.3 Information retrieval
2.3.1 Common terminology
2.4 Commonly used IR models
2.4.1 Vector Space Model(VSM)
2.4.2 Revised Vector Space Model(r VSM)
2.4.3 Latent Semantic Indexing(LSI)
2.4.4 Probabilistic Latent Semantic Indexing(PLSI)
2.4.5 Latent Dirichlet Allocation
2.5 Word embeddings
2.5.1 Vector space model and statistical language model
2.5.2 Representing text with embeddings
2.5.3 Types of word embeddings
2.6 Abstract Syntax Trees
2.7 Summary
Chapter3 Bridging the Lexical Gap
3.1 Pretrained Word Embedding Models
3.1.1 word2Vec model trained on Stack Overflow posts
3.1.2 Fast Text model trained on Common Crawl
3.1.3 Glo Ve model trained on Common Crawl
3.1.4 fast Text model trained on source code files
3.2 Types of similarity
3.2.1 Lexical similarity
3.2.2 Semantic similarity
3.3 Similarity measures
3.3.1 Cosine similarity
3.3.2 Word Mover distance
3.4 Objective Function and Optimization
3.4.1 Differential evolution
3.5 Structure of the model
3.6 Summary
Chapter4 Experimental Setup And Results
4.1 Data collection
4.2 Parsing and preprocessing
4.2.1 Tokenization and linguistic preprocessing of tokens
4.3 Experiments with different preprocessing techniques
4.3.1 Embedding whole content of source files
4.3.2 Parsing ASTs of source code files
4.4 Experiments with different pretrained vectors
4.5 Evaluation
4.6 Results
4.6.1 fast Text vectors trained on Common Crawl data
4.6.2 Glo Ve vectors trained on Common Crawl data
4.6.3 Word2Vec vectors trained on Stack Overflow data
4.7 Comparison with other models
4.7.1 Comparison with the base r VSM model
4.7.2 Comparison of the proposed model with Bug Locator
4.8 Summary
Conclusion
References
Acknowledgements
Resume
本文编号:3830746
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/3830746.html