面向网络欺诈行为发现的不确定数据离群点检测算法研究
发布时间:2019-05-27 00:03
【摘要】:随着互联网的飞速发展,人们的日常生活变得与网络密不可分。与此同时,频频出现的网络欺诈行为成为影响人们正常网络生活的重要因素。离群点检测技术是一种重要的数据挖掘技术,也是异常检测的重要手段,而基于距离的离群点检测是目前最常用离群点检测技术之一。本文对面向网络欺诈行为发现的不确定数据离群点检测算法展开研究。网络欺诈行为多发生在网络交易过程中并伴随着异常的网络交易行为。本文将每个用户的网络交易行为看做一个数据对象,将其映射到一个多维空间之中,网络交易行为的每个属性分别作为该空间的一个维度。一次异常的网络交易行为往往体现为偏离大多数数据对象的少数数据,对这些数据的检测即为该多维空间中的离群点检测。于此同时,由于数据不完整、噪声干扰、操作失误等原因,网络交易行为数据往往存在不确定性。本文对不确定数据集上基于距离的离群点检测算法展开研究,旨在高效、合理地检测出不确定离群点,为异常网络交易和网络欺诈行为发现提供帮助。本文首先使用x-tuple模型和可能世界语义模型对不确定数据集进行描述。每一个不确定数据对象表示为一个x-tuple,它的每一个可能出现的数据实例表示为一个tuple,来自不同x-tuple的若干tuple构成一个可能世界。一个可能世界是不确定数据集的一个实例。随后本文将不确定数据集上的离群点检测看做一个查询过程,针对不同的数据特征分别提出了不确定数据集上的期望离群点检测、半期望离群点检测、全概率离群点检测和相对离群点检测四种全新的概念。期望离群点检测是其中最简单的不确定数据集上离群点检测概念,它为每一个tuple和每一个x-tuple计算一个期望离群度,从整个数据集上查询得到期望离群度最高的K个x-tuple。半期望离群点检测是对期望离群点检测的改进,它解决了后者容易受到数据不完整性影响的问题。该检测方法只计算每个tuple的期望离群度而不再计算各个x-tuple的期望离群度,所以称之为半期望离群度。相对离群点检测解决了前面两种离群点检测概念容易受到阵发性数据和噪声影响的问题。它不再计算各个tuple和x-tuple的期望离群度,而是通过各个x-tuple两两比较找出最可能成为离群点的K个x-tuple。该方法还避免了一些参数阈值的确定,降低了离群点检测应用的门槛,特别适合不是特定应用领域专家的普通用户使用。本文最后提出了全概率离群点检测的概念。它借鉴不确定数据集上全局top-K查询的思想,计算各个x-tuple在任意可能世界中成为top-k1离群点的概率,概率最高的k2个x-tuple即为不确定数据集上的离群点。本文形式化地给出了上述四种不确定数据离群点的定义,提出了算法框架,在此基础上设计了剪枝优化策略并形成了高效的优化算法,最后通过在真实数据集和人工数据集上的实验对算法精度、效率、剪枝优化策略的有效性和算法可扩展性等进行了验证。已有的不确定数据集上基于距离的离群点检测研究往往存在不足,一是假设不确定数据数据服从某个已知的分布,特别是正态分布等存在概率密度函数的解析表达式的分布。但这在实际应用中往往难以实现,这限制了相关研究的应用。二是有些研究虽然同样采用了x-tuple模型和可能世界语义描述不确定数据集,但他们忽略了数据多样性,一个不确定数据并没有体现为多个可能出现的实例。本文提出的新的不确定数据离群点检测概念能够适用于各种概率分布环境,同时考虑了数据不完整性和多样性,能够高效、合理地实现离群点检测。
[Abstract]:With the rapid development of the Internet, people's daily life becomes inseparable from the network. At the same time, frequent network fraud has become an important factor that affects people's normal network life. Outlier detection is an important data mining technique, and is an important means of anomaly detection, and the detection of outliers based on distance is one of the most common outlier detection techniques. This paper studies the non-deterministic data outliers detection algorithm, which is found in the network-oriented fraud behavior. Network fraud often occurs in the course of the network transaction and is accompanied by the abnormal network transaction behavior. In this paper, the network transaction behavior of each user is considered as a data object, which is mapped into a multi-dimensional space, and each attribute of the network transaction behavior is used as one dimension of the space respectively. An abnormal network transaction behavior is often embodied as a few data from most of the data objects, and the detection of these data is an outlier detection in the multi-dimensional space. At the same time, the network transaction behavior data is often uncertain due to incomplete data, noise interference, operation error and the like. This paper, based on the distance-based outlier detection algorithm for uncertain data sets, is designed to efficiently and reasonably detect out-of-the-point outliers, and provide help for the discovery of abnormal network transactions and network fraud. This paper first uses the x-tuple model and the possible world semantic model to describe the uncertain data set. Each indeterminate data object is represented as an x-tuple, each possible data instance of which is represented as a tuple, and a number of tuple from different x-tuple form a possible world. One possible world is an example of an uncertain data set. In this paper, the outlier detection on the data set is not determined as a query process, and four new concepts, such as the desired outlier detection, the semi-expected outlier detection, the full-probability outlier detection and the relative outliers, are presented for different data features, respectively. It is expected that the outlier detection is one of the most simple outlier detection concepts in the data set, which calculates a desired outlier for each tuple and each x-tuple, and queries the K x-tuple with the highest expected outliers from the entire set of data. The semi-expected outlier detection is an improvement in the detection of the desired outlier, which solves the problem that the latter is susceptible to data integrity. The detection method only calculates the expected outliers for each tuple and no longer calculates the expected outliers for each x-tuple, which is referred to as a semi-expected outlier. The relative outlier detection solves the problem that the two previous outlier detection concepts are susceptible to paroxysmal data and noise. It no longer calculates the expected outliers for each tuple and x-tuple, but rather finds the K x-tuple that is most likely to be an outlier by comparing the x-tuple. The method also avoids the determination of some parameter thresholds, reduces the threshold of the outlier detection application, and is particularly suitable for ordinary users of a specific application field expert. In this paper, the concept of all-probability outliers detection is put forward. It uses the idea of not to determine the global top-K query on the data set, and calculates the probability that each x-tuple is the top-k1 outlier in any possible world, and the highest probability of k2 x-tuple is that the outliers on the data set are not determined. In this paper, the definition of the four uncertain data outliers is given in this paper, and the algorithm framework is put forward. On this basis, the pruning optimization strategy is designed and an efficient optimization algorithm is formed, and the accuracy and efficiency of the algorithm are finally improved by the experiments on the real data set and the artificial data set. The effectiveness of the pruning optimization strategy and the scalability of the algorithm are verified. It is often not enough to determine the distance-based outlier detection in the existing data set. First, it is assumed that the data data is not determined to be subject to a known distribution, especially the distribution of the analytical expression of the probability density function such as a normal distribution. But this is often difficult to achieve in practical applications, which limits the application of the related studies. Second, some studies, while using the x-tuple model and possibly the world semantic description, do not determine the data set, but they ignore the data diversity, and an uncertain data is not shown as a number of possible instances. The new non-deterministic data outliers detection concept proposed in this paper can be applied to various probability distribution environments, while considering the data incompleteness and diversity, the outliers detection can be efficiently and reasonably realized.
【学位授予单位】:国防科学技术大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP311.13
,
本文编号:2485740
[Abstract]:With the rapid development of the Internet, people's daily life becomes inseparable from the network. At the same time, frequent network fraud has become an important factor that affects people's normal network life. Outlier detection is an important data mining technique, and is an important means of anomaly detection, and the detection of outliers based on distance is one of the most common outlier detection techniques. This paper studies the non-deterministic data outliers detection algorithm, which is found in the network-oriented fraud behavior. Network fraud often occurs in the course of the network transaction and is accompanied by the abnormal network transaction behavior. In this paper, the network transaction behavior of each user is considered as a data object, which is mapped into a multi-dimensional space, and each attribute of the network transaction behavior is used as one dimension of the space respectively. An abnormal network transaction behavior is often embodied as a few data from most of the data objects, and the detection of these data is an outlier detection in the multi-dimensional space. At the same time, the network transaction behavior data is often uncertain due to incomplete data, noise interference, operation error and the like. This paper, based on the distance-based outlier detection algorithm for uncertain data sets, is designed to efficiently and reasonably detect out-of-the-point outliers, and provide help for the discovery of abnormal network transactions and network fraud. This paper first uses the x-tuple model and the possible world semantic model to describe the uncertain data set. Each indeterminate data object is represented as an x-tuple, each possible data instance of which is represented as a tuple, and a number of tuple from different x-tuple form a possible world. One possible world is an example of an uncertain data set. In this paper, the outlier detection on the data set is not determined as a query process, and four new concepts, such as the desired outlier detection, the semi-expected outlier detection, the full-probability outlier detection and the relative outliers, are presented for different data features, respectively. It is expected that the outlier detection is one of the most simple outlier detection concepts in the data set, which calculates a desired outlier for each tuple and each x-tuple, and queries the K x-tuple with the highest expected outliers from the entire set of data. The semi-expected outlier detection is an improvement in the detection of the desired outlier, which solves the problem that the latter is susceptible to data integrity. The detection method only calculates the expected outliers for each tuple and no longer calculates the expected outliers for each x-tuple, which is referred to as a semi-expected outlier. The relative outlier detection solves the problem that the two previous outlier detection concepts are susceptible to paroxysmal data and noise. It no longer calculates the expected outliers for each tuple and x-tuple, but rather finds the K x-tuple that is most likely to be an outlier by comparing the x-tuple. The method also avoids the determination of some parameter thresholds, reduces the threshold of the outlier detection application, and is particularly suitable for ordinary users of a specific application field expert. In this paper, the concept of all-probability outliers detection is put forward. It uses the idea of not to determine the global top-K query on the data set, and calculates the probability that each x-tuple is the top-k1 outlier in any possible world, and the highest probability of k2 x-tuple is that the outliers on the data set are not determined. In this paper, the definition of the four uncertain data outliers is given in this paper, and the algorithm framework is put forward. On this basis, the pruning optimization strategy is designed and an efficient optimization algorithm is formed, and the accuracy and efficiency of the algorithm are finally improved by the experiments on the real data set and the artificial data set. The effectiveness of the pruning optimization strategy and the scalability of the algorithm are verified. It is often not enough to determine the distance-based outlier detection in the existing data set. First, it is assumed that the data data is not determined to be subject to a known distribution, especially the distribution of the analytical expression of the probability density function such as a normal distribution. But this is often difficult to achieve in practical applications, which limits the application of the related studies. Second, some studies, while using the x-tuple model and possibly the world semantic description, do not determine the data set, but they ignore the data diversity, and an uncertain data is not shown as a number of possible instances. The new non-deterministic data outliers detection concept proposed in this paper can be applied to various probability distribution environments, while considering the data incompleteness and diversity, the outliers detection can be efficiently and reasonably realized.
【学位授予单位】:国防科学技术大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP311.13
,
本文编号:2485740
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2485740.html