Imbalanced Big Data Classification Algorithm Based on Spark
发布时间:2023-04-28 19:09
随着信息时代的发展,数据的产生速度不断加快,为了先于他人获得利益或者提前避免危机,人们开始着力于从现有的数据中挖掘出隐藏的信息加以利用,但是部分重要的信息并不包含在多数类的数据中,它们只存在于少数类,例如癌症确诊,信用诈骗等。因此在大数据集中识别微量级的数据类别成为现在研究的重点。对于不平衡数据集,传统算法非常倾向于多数据分类。难以实现识别较少数据分类的精度。实际上,在现实生活中,有许多少数群体更有价值和更具代表性。国内外对不均衡数据的分类做了相当多的研究,处理不均衡数据的分类方法主要分为两种,一种是对数据本身进行处理,另一种是对分类算法进行改进。在数据层面的方法主要包括过抽样策略和欠抽样策略,这两种方法分别对少数类进行扩充或者移除部分多数类数据以达到类别间数据量的平衡。在算法层面的改进方式主要包括改变概率密度、单类学习分类、集成算法以及核方法。数据不均衡主要表现在两个方面,第一方面为类间不均衡,即某类样本数量明显少于其他类样本数量但是类别间边界较为清晰。另一方面为类内不均衡,即一个类别中包含多个类别,同时还有重叠的部分,这个问题会导致分类器无法有效分辨出少数类样本噪声和少数类样本子集...
【文章页数】:64 页
【学位级别】:硕士
【文章目录】:
Acknowledgements
Abstract
1 Introduction
1.1 Research Background
1.2 Research status at home and abroad
1.2.1 Approach of processing data sets
1.2.2 Algorithm level approach
1.2.3 Computing framework
1.3 Thesis innovation
1.4 Structure of thesis
2 Related Work
2.1 Distributed Computing framework
2.1.1 Introduction to Spark
2.1.2 Resilient Distributed Datasets
2.2 Traditional imbalanced big data processing method and principle
2.2.1 The nature of data imbalance
2.2.2 Method of equalizing data
2.3 Algorithm introduction
2.3.1 SMOTE algorithm
2.3.2 SimHash algorithm
3 Specific algorithm design and implementation
3.1 Data description
3.2 Algorithm improvement and implementation process
3.2.1 SimHash algorithm improvement-dimensionality reduction
3.2.2 Improved SMOTE algorithm
3.2.3 KNN algorithm improvement
3.2.4 Model evaluation criteria
3.2.5 Implementation of KNN Algorithm Based on Hash Technology andSpark
4 Experimental results and analysis
4.1 Data Description
4.2 Experimental environment
4.3 Complete steps of the experimental design
4.4 Algorithm efficiency
4.5 Algorithm accuracy
5 Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
References
Appendix A 摘要
本文编号:3804301
【文章页数】:64 页
【学位级别】:硕士
【文章目录】:
Acknowledgements
Abstract
1 Introduction
1.1 Research Background
1.2 Research status at home and abroad
1.2.1 Approach of processing data sets
1.2.2 Algorithm level approach
1.2.3 Computing framework
1.3 Thesis innovation
1.4 Structure of thesis
2 Related Work
2.1 Distributed Computing framework
2.1.1 Introduction to Spark
2.1.2 Resilient Distributed Datasets
2.2 Traditional imbalanced big data processing method and principle
2.2.1 The nature of data imbalance
2.2.2 Method of equalizing data
2.3 Algorithm introduction
2.3.1 SMOTE algorithm
2.3.2 SimHash algorithm
3 Specific algorithm design and implementation
3.1 Data description
3.2 Algorithm improvement and implementation process
3.2.1 SimHash algorithm improvement-dimensionality reduction
3.2.2 Improved SMOTE algorithm
3.2.3 KNN algorithm improvement
3.2.4 Model evaluation criteria
3.2.5 Implementation of KNN Algorithm Based on Hash Technology andSpark
4 Experimental results and analysis
4.1 Data Description
4.2 Experimental environment
4.3 Complete steps of the experimental design
4.4 Algorithm efficiency
4.5 Algorithm accuracy
5 Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
References
Appendix A 摘要
本文编号:3804301
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/3804301.html