处理静态数据和流数据中离群点检测问题的有效方法
发布时间:2024-02-20 04:42
数据的可访问性、便捷性和可靠性是十分关键的,任何形式的干净数据都已成为当今社会中人类的新财富。在许多领域里,由于数据本身大容量和高速传输的特点所带来的巨大挑战,维护高质量数据的能力已经变得十分重要。数据可以给各个行业的企业提供对其企业活动的价值分析进而帮助企业激发其最佳潜力,并在与对手竞争中获得更大的优势。因此企业现在大力投资研发数据挖掘技能,期待从不同类型数据中发现隐性的数据价值。离群点检测是一项非常重要的数据挖掘任务,其目的是检测偏离正常数据预期模式的对象,因为离群点有极大可能影响数据分析结果。离群点检测是一个在不同领域、不同数据类型中有着广泛应用的重要问题。离群点有许多潜在的来源,在大数据集中识别它们需要有效的方法。随着数字时代的发展,离群点的检测变得越来越具有挑战性。例如,随着传统批处理数据的革命,我们现在看到大量的数据以高速、动态的方式连续生成。这些类型的数据可能包含冗余信息,并且通常会影响离群点检测方法的效率和总体性能。多年来,为解决离群点检测带来的挑战,使用不同算法的方法和技术被提出。一些常见的困难与输入数据的性质、离群值类型、数据标签、准确性以及CPU时间和内存消耗方面...
【文章页数】:163 页
【学位级别】:博士
【文章目录】:
摘要
Abstract
Abbreviations
Chapter 1 Introduction
1.1 Motivation
1.2 Fundamental Concepts
1.2.1 The Definition of Outliers
1.2.2 Static Data
1.2.3 Streaming Data
1.2.4 Causes of Outliers, Identification Process and Handling Process
1.2.5 Application Areas of Outlier Detection
1.3 Research Goals and Contributions
1.4 Main Contents and Technological Route
Chapter 2 Related Work-Progress in Outlier Detection Techniques
2.1 Outlier Detection Methods
2.2 Statistical-Based Approaches
2.2.1 Parametric Methods
2.2.2 Non-Parametric Methods
2.2.3 Advantages, Disadvantages and Challenges
2.3 Distance-Based Approaches
2.3.1 K-Nearest Neighbor Methods
2.3.2 Pruning Methods
2.3.3 Data Stream Methods
2.3.4 Advantages, Disadvantages and Challenges
2.4 Clustering-Based Approaches
2.4.1 Partitioning and Hierarchical Clustering Methods
2.4.2 Density-based and Grid-based Clustering Methods
2.4.3 Advantages, Disadvantages and Challenges
2.5 Chapter Summary
Chapter 3 Parametric and Non-Parametric Approach for High-AccurateOutlier Detection in Static Data
3.1 Introduction
3.2 Parametric Approach
3.2.1 Gaussian Mixture Model for Outlier Detection (GMMOD)
3.2.2 Learning Model and Algorithms
3.2.3 The GMMOD Algorithm
3.3 Non-Parametric Approach
3.3.1 Kernel Density Estimation for Outlier Detection (KDEOD)
3.3.2 Bandwidth Selection
3.3.3 The KDEOD Algorithm
3.4 Experimental Evaluation
3.4.1 Experimental Setup
3.4.2 Data Description
3.4.3 Performance Evaluation
3.4.4 Experimental Results
3.4.5 Discussion
3.5 Chapter Summary
Chapter 4 An Effective Minimal Probing Approach Distance-Based Outlier Detection in Data Streams
4.1 Introduction
4.2 Definition of Key Terms
4.3 Problem Formulation
4.4 Methodology
4.4.1 Micro-Cluster with Minimal Probing
4.4.2 Data Points Within the Current Window
4.4.3 Processing the New Data Points and New Slide
4.4.4 Processing the Expired Window and Slide
4.4.5 Processing and Reporting Outliers
4.5 Experiments and Results
4.5.1 Varying Window Size, W
4.5.2 Varying the Nearest Neighbor Count, K
4.5.3 Varying the Distance Threshold, R
4.5.4 Complexity Analysis
4.5.5 The Advantage and Disadvantages of the Proposed Method
4.6 Chapter Summary
Chapter 5 CLODS: An Effective Clustering-Based Technique for DetectingOutliers in Data Streams
5.1 Introduction
5.2 Preliminaries and Problem Statement
5.3 Methodology
5.3.1 Fundamentals of the Proposed Method
5.3.2 The Proposed Framework
5.3.3 The Data Stream Stage
5.3.4 Data Preprocessing Stage
5.3.5 Sliding Window Based Outlier Detection Stage
5.3.6 The Clustering Process Stage
5.3.7 The Outlier Detection Stage
5.4 Experimental Setup and Results
5.4.1 Experimental Setup
5.4.2 Results and Discussions
5.4.2.1 CPU Time
5.4.2.2 Memory Usage
5.4.2.3 Space and Time Complexity
5.4.2.4 Data Points in Cluster
5.5 Chapter Summary
Conclusions and Future Work
References
List of Publications
Acknowledgements
Resume
本文编号:3903857
【文章页数】:163 页
【学位级别】:博士
【文章目录】:
摘要
Abstract
Abbreviations
Chapter 1 Introduction
1.1 Motivation
1.2 Fundamental Concepts
1.2.1 The Definition of Outliers
1.2.2 Static Data
1.2.3 Streaming Data
1.2.4 Causes of Outliers, Identification Process and Handling Process
1.2.5 Application Areas of Outlier Detection
1.3 Research Goals and Contributions
1.4 Main Contents and Technological Route
Chapter 2 Related Work-Progress in Outlier Detection Techniques
2.1 Outlier Detection Methods
2.2 Statistical-Based Approaches
2.2.1 Parametric Methods
2.2.2 Non-Parametric Methods
2.2.3 Advantages, Disadvantages and Challenges
2.3 Distance-Based Approaches
2.3.1 K-Nearest Neighbor Methods
2.3.2 Pruning Methods
2.3.3 Data Stream Methods
2.3.4 Advantages, Disadvantages and Challenges
2.4 Clustering-Based Approaches
2.4.1 Partitioning and Hierarchical Clustering Methods
2.4.2 Density-based and Grid-based Clustering Methods
2.4.3 Advantages, Disadvantages and Challenges
2.5 Chapter Summary
Chapter 3 Parametric and Non-Parametric Approach for High-AccurateOutlier Detection in Static Data
3.1 Introduction
3.2 Parametric Approach
3.2.1 Gaussian Mixture Model for Outlier Detection (GMMOD)
3.2.2 Learning Model and Algorithms
3.2.3 The GMMOD Algorithm
3.3 Non-Parametric Approach
3.3.1 Kernel Density Estimation for Outlier Detection (KDEOD)
3.3.2 Bandwidth Selection
3.3.3 The KDEOD Algorithm
3.4 Experimental Evaluation
3.4.1 Experimental Setup
3.4.2 Data Description
3.4.3 Performance Evaluation
3.4.4 Experimental Results
3.4.5 Discussion
3.5 Chapter Summary
Chapter 4 An Effective Minimal Probing Approach Distance-Based Outlier Detection in Data Streams
4.1 Introduction
4.2 Definition of Key Terms
4.3 Problem Formulation
4.4 Methodology
4.4.1 Micro-Cluster with Minimal Probing
4.4.2 Data Points Within the Current Window
4.4.3 Processing the New Data Points and New Slide
4.4.4 Processing the Expired Window and Slide
4.4.5 Processing and Reporting Outliers
4.5 Experiments and Results
4.5.1 Varying Window Size, W
4.5.2 Varying the Nearest Neighbor Count, K
4.5.3 Varying the Distance Threshold, R
4.5.4 Complexity Analysis
4.5.5 The Advantage and Disadvantages of the Proposed Method
4.6 Chapter Summary
Chapter 5 CLODS: An Effective Clustering-Based Technique for DetectingOutliers in Data Streams
5.1 Introduction
5.2 Preliminaries and Problem Statement
5.3 Methodology
5.3.1 Fundamentals of the Proposed Method
5.3.2 The Proposed Framework
5.3.3 The Data Stream Stage
5.3.4 Data Preprocessing Stage
5.3.5 Sliding Window Based Outlier Detection Stage
5.3.6 The Clustering Process Stage
5.3.7 The Outlier Detection Stage
5.4 Experimental Setup and Results
5.4.1 Experimental Setup
5.4.2 Results and Discussions
5.4.2.1 CPU Time
5.4.2.2 Memory Usage
5.4.2.3 Space and Time Complexity
5.4.2.4 Data Points in Cluster
5.5 Chapter Summary
Conclusions and Future Work
References
List of Publications
Acknowledgements
Resume
本文编号:3903857
本文链接:https://www.wllwen.com/kejilunwen/shengwushengchang/3903857.html
最近更新
教材专著