基于大数据的密度偏差抽样算法及应用研究
[Abstract]:With the introduction of the concept of big data, data mining has become a hot research topic in the field of big data. Aiming at the problem of computing resources and spatial resources consumed by big data mining, improving the efficiency of processing large-scale data has become the key to solve this kind of problem. At present, the main methods to improve the implementation efficiency of data mining in the field of clustering analysis are as follows: one is to improve the classical clustering algorithm, the other is to reduce the size of the original data set by means of sampling technology. Under the background of big data, the data growth rate is much faster than the algorithm improvement and update speed. Therefore, sampling technology is particularly important in cluster analysis. The traditional sampling technique is applied to the data set with large deviation and unknown distribution, which will lead to the problems of poor sampling effect, poor sample representativeness and class loss, and density deviation sampling can effectively solve this kind of problem. In this paper, the density sampling algorithm is used to study the uneven distribution of data sets, and the sampling algorithm suitable for this kind of data is explored. In recent years, the research on density deviation sampling algorithm mainly lies in how to divide the grid space which is consistent with the data set according to the information characteristics of the original data set. In order to solve the problem that the construction of variable grid takes up a lot of time resources, the existing variable grid partition method is improved in this paper. Firstly, the granularity of each dimension data is determined dynamically according to the mean information of each dimension data of the original data set. Secondly, the interval density similarity is used to adjust the interval to construct a variable grid space which is consistent with the distribution of the original dataset. Finally, a density deviation sampling optimization algorithm based on mean information is designed by combining grid space with density deviation sampling algorithm. Through the verification and analysis of the algorithm, the results show that the algorithm can not only avoid class loss, effectively improve sample quality and shorten sampling time, but also has some advantages in execution efficiency.
【学位授予单位】:贵州民族大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:C81
【参考文献】
相关期刊论文 前6条
1 盛开元;钱雪忠;吴秦;;基于可变网格划分的密度偏差抽样算法[J];计算机应用;2013年09期
2 余波;朱东华;刘嵩;郑涛;;密度偏差抽样技术在聚类算法中的应用研究[J];计算机科学;2009年02期
3 纪良浩;;基于密度偏差抽样的聚类算法研究[J];重庆邮电大学学报(自然科学版);2007年06期
4 张建锦;吴渝;刘小霞;;一种改进的密度偏差抽样算法[J];计算机应用;2007年07期
5 李双虎,王铁洪;Kmeans聚类分析算法中一个新的确定聚类个数有效性的指标[J];河北省科学院学报;2003年04期
6 赵恒,杨万海;模糊K-Modes聚类精确度分析[J];计算机工程;2003年12期
相关会议论文 前1条
1 张建锦;刘小霞;;密度偏差抽样及其在海量数据挖掘中的应用[A];2006北京地区高校研究生学术交流会——通信与信息技术会议论文集(下)[C];2006年
相关硕士学位论文 前10条
1 孙志鹏;高维数据聚类算法的研究及应用[D];江南大学;2017年
2 肖雪平;面向大规模数据集的自适应聚类算法并行化研究[D];曲阜师范大学;2016年
3 孙佳;基于聚类算法的大数据样本集优化的研究[D];长春工业大学;2016年
4 张晓;基于超网络的高维数据聚类方法研究[D];山东师范大学;2015年
5 吕辉;基于大数据和高维数据的聚类方法的研究与设计实现[D];云南大学;2015年
6 盛开元;聚类算法在大规模数据集上的应用研究[D];江南大学;2014年
7 赵卓真;一种基于密度与网格的聚类方法[D];中山大学;2012年
8 段明秀;层次聚类算法的研究及应用[D];中南大学;2009年
9 连健;基于GIS的抽样框编制与抽样技术方法研究[D];首都师范大学;2008年
10 朱强;粒度计算在聚类分析中的应用[D];安徽大学;2007年
,本文编号:2480786
本文链接:https://www.wllwen.com/shekelunwen/shgj/2480786.html