基于相关子空间的上下文离群数据并行挖掘

发布时间：2018-04-25 19:49

本文选题：离群数据 + 上下文信息　；参考：《太原科技大学》2017年硕士论文

【摘要】：离群数据是数据挖掘领域的一个重要研究内容,指的是在给定的数据集中,与其他大部分数据的特征不一致,有明显差异的数据。随着数据量和数据维度的爆炸式增长,传统的离群数据挖掘算法效率低的缺点凸显出来,难以适用于海量高维数据集。此外,传统的离群数据挖掘一般只注重于挖掘的效率和精度,而对于其挖掘结果的可解释性和可理解性研究相对较少,导致离群数据难以理解。本文采用相关子空间,对上下文离群数据并行挖掘方法进行了较深入研究。其主要研究成果如下:(1)给出一种MapReduce编程模型下的上下文离群数据挖掘算法。该算法利用局部稀疏差异度,确定数据对象的相关子空间,并计算该数据对象在该相关子空间下的离群因子;将其离群因子和相关子空间中相关属性维集定义为数据对象的上下文信息;选取离群因子最大的N个数据对象,作为上下文离群数据;利用MapReduce编程模型,给出了一种上下文离群数据并行挖掘算法;最后,在UCI数据集上,实验验证了该算法所具有的上下文信息,能有效地提高离群数据的可解释性和可理解性。(2)采用Spark内存计算平台,给出了一种基于相关子空间的上下文离群数据并行挖掘算法。该算法借助于弹性分布式数据集(RDD),将K近邻集、局部稀疏度矩阵与局部稀疏差异度矩阵等保留在内存中,从而有效地提高了离群数据挖掘效率,降低了I/O代价。采用天体光谱数据集,实验验证了该算法在Spark内存计算平台下,具有良好的可伸缩性和可扩展性。
[Abstract]:Outlier data is an important research content in the field of data mining. It refers to the data that is different from most other data in a given data set. With the explosive growth of data volume and data dimension, the shortcomings of traditional outlier data mining algorithm are highlighted, and it is difficult to apply to mass high-dimensional data sets. In addition, the traditional outlier data mining only focuses on the efficiency and precision of mining, but there are few researches on the interpretability and comprehensibility of the results of outlier mining, which leads to the outlier data being difficult to understand. In this paper, the parallel mining method of contextual outlier data is studied by using the correlation subspace. The main research results are as follows: 1) A context outlier mining algorithm based on MapReduce programming model is presented. The algorithm uses the local sparse difference to determine the correlation subspace of the data object, and calculates the outlier factor of the data object under the correlation subspace. The dimension set of outliers and related attributes in the correlation subspace is defined as the context information of the data objects; the N data objects with the largest outliers are selected as the contextual outliers; and the MapReduce programming model is used. A parallel mining algorithm for contextual outlier data is presented. Finally, the context information of the algorithm is verified by experiments on the UCI dataset. It can effectively improve the interpretability and comprehensibility of outlier data. Using the Spark memory computing platform, a parallel mining algorithm of contextual outlier data based on correlation subspace is presented. With the help of the elastic distributed data set (RDDN), the K-nearest neighbor set, the local sparsity matrix and the local sparsity difference matrix are kept in memory, which effectively improves the efficiency of outlier data mining and reduces the I / O cost. The experimental results show that the proposed algorithm is scalable and scalable on Spark memory computing platform.
【学位授予单位】：太原科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】