基于大数据的密度偏差抽样算法及应用研究

发布时间：2019-05-19 14:07

【摘要】：随着“大数据”概念的提出,数据挖掘成为“大数据”学科领域的研究热点。针对大数据挖掘所消耗的计算资源和空间资源问题,提高处理巨大规模数据的效率已成为解决此类问题的关键。目前在聚类分析领域提高数据挖掘执行效率方法主要有:一是通过改进经典聚类算法;二是借助于抽样技术约简原始数据集规模。在大数据背景下面临数据快速增长,数据增长速度远远大于算法改进、更新速度。因此,抽样技术在聚类分析中显得尤为重要。传统抽样技术运用于偏斜较大和未知分布的数据集,其将导致抽样效果不理想、样本代表性差和类丢失等问题,而采用密度偏差抽样能有效解决此类问题。本文主要利用密度抽样算法对分布不均匀的数据集进行研究,探究适用于该类数据的抽样算法。近年,对密度偏差抽样算法研究主要在于如何根据原始数据集的信息特征划分与数据集保持一致的网格空间。文中针对构建可变网格占用时间资源多的问题,改进已有的可变网格划分方法。首先,该方法根据原始数据集每维数据的均值信息动态确定每维数据划分粒度。其次,利用区间密度相似性调整区间,构建与原始数据集分布保持一致的可变网格空间。最后,将网格空间与密度偏差抽样算法相结合,设计一种基于均值信息构建可变网格的密度偏差抽样优化算法。通过对算法进行验证分析,结果表明该算法处理大规模分布不均匀的数据集,不仅能避免类丢失、有效提高样本质量和缩短抽样时间,而且在执行效率上具有一定优势。
[Abstract]:With the introduction of the concept of big data, data mining has become a hot research topic in the field of big data. Aiming at the problem of computing resources and spatial resources consumed by big data mining, improving the efficiency of processing large-scale data has become the key to solve this kind of problem. At present, the main methods to improve the implementation efficiency of data mining in the field of clustering analysis are as follows: one is to improve the classical clustering algorithm, the other is to reduce the size of the original data set by means of sampling technology. Under the background of big data, the data growth rate is much faster than the algorithm improvement and update speed. Therefore, sampling technology is particularly important in cluster analysis. The traditional sampling technique is applied to the data set with large deviation and unknown distribution, which will lead to the problems of poor sampling effect, poor sample representativeness and class loss, and density deviation sampling can effectively solve this kind of problem. In this paper, the density sampling algorithm is used to study the uneven distribution of data sets, and the sampling algorithm suitable for this kind of data is explored. In recent years, the research on density deviation sampling algorithm mainly lies in how to divide the grid space which is consistent with the data set according to the information characteristics of the original data set. In order to solve the problem that the construction of variable grid takes up a lot of time resources, the existing variable grid partition method is improved in this paper. Firstly, the granularity of each dimension data is determined dynamically according to the mean information of each dimension data of the original data set. Secondly, the interval density similarity is used to adjust the interval to construct a variable grid space which is consistent with the distribution of the original dataset. Finally, a density deviation sampling optimization algorithm based on mean information is designed by combining grid space with density deviation sampling algorithm. Through the verification and analysis of the algorithm, the results show that the algorithm can not only avoid class loss, effectively improve sample quality and shorten sampling time, but also has some advantages in execution efficiency.
【学位授予单位】：贵州民族大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：C81

【参考文献】