基于PAC模型的并行关联分析随机算法
发布时间:2018-11-26 14:37
【摘要】:随着大数据时代的来临以及数据集容量的迅速增长,基于并行/分布式计算的频繁模式挖掘相比受内存和节点限制的传统技术在处理海量数据集时有较为明显的优势。正是处于当前的背景下,本研究论文提出一个有效的基于随机抽样的关联分析算法用来从大规模数据集内发现近似频繁项集。本算法的核心在于选择一个数据集的随机样本来挖掘近似频繁项集。标准的Apriori和FP-Growth算法往往需要多次扫描整个数据集来找到具体的频繁项集,对于极大数据集而言,这必然会付出相当高的存储和计算代价。为了更有效地进行大数据集挖掘,本文提出两个新的基于随机抽样的技术从事务型数据集样本中提取高质量的近似频繁项集,并且给予较高的概率保证该近似项集是数据集内真实频繁项集的超集。其中一个方法应用计算学习理论中的经验Rademacher复杂度,结合集中不等式,来推导出一个基于数据集样本经验Rademacher复杂度的近似上界,进而运用统计学习理论中的相关结论来找到项集绝对频率误差上确界的一个近似上界。另一个方法则直接利用经典的Bernstein不等式的变形来限定项集的绝对频率误差。然后将PAC学习框架应用于这两种频繁项集抽样方法的分析中,通过频繁项集的(?,δ)-近似来推导出各自的满足PAC样本完整性的近似样本边界,最后算法根据该边界值选取相应的数据集样本来挖掘近似频繁项集。自频繁项集发现问题被提出以来,后续有许多文献致力于研究基于抽样技术的频繁项集发现方法。同时随着单机串行计算瓶颈的出现,越来越多的研究工作开展基于并行/分布式计算平台的频繁项集发现任务。本研究论文的扩展实验既采用真实的retail数据集,也包括人造数据集T10I4D100K,分别从数据集样本挖掘实验中返回的近似频繁项集的召回率,准确率以及频率误差估计三方面来评估方法的正确性,同时从串行计算(基于单机)和并行计算(基于Hadoop伪分布式平台)两方面综合评估方法的运行时间,最后对提出的不同的理论方法的性能以及对样本空间的需求进行了对比。根据目前所了解的知识情况,本研究首次将经验Rademacher复杂度以及Bernstein不等式应用于基于抽样技术的频繁项集发现问题中,并利用统计学习中的相关结论推导出相对紧凑的理论样本边界,该边界相比现有的Riondato和Upfal提出的基于VC维的频繁项集发现理论样本边界更为紧凑,因而具有更优的执行效率。同时,我们研究方法建议的样本复杂度是近似误差倒数1/?和置信参数倒数1/δ的多项式函数,故而也是PAC可学习的。我们采用计算学习理论中一些经典的概念和技术来解决一个实际问题,这种研究思路或许也可以运用到数据挖掘的其他方面。
[Abstract]:With the advent of big data era and the rapid growth of data set capacity, frequent pattern mining based on parallel / distributed computing has obvious advantages over traditional memory and node constrained technologies in processing large data sets. Under the current background, this paper proposes an efficient association analysis algorithm based on random sampling to find approximate frequent itemsets from large data sets. The core of this algorithm is to select a random sample of data set to mine approximate frequent itemsets. Standard Apriori and FP-Growth algorithms often need to scan the whole data set several times to find specific frequent itemsets, which is bound to pay a high cost of storage and computation for large data sets. In order to mine big data sets more effectively, this paper proposes two new techniques based on random sampling to extract high quality approximate frequent itemsets from transactional dataset samples. A higher probability is given to ensure that the approximate itemset is a superset of the true frequent itemsets in the dataset. In one method, an approximate upper bound of empirical Rademacher complexity based on data set samples is derived by applying empirical Rademacher complexity in computational learning theory and set inequality. Then an approximate upper bound of the upper bound of absolute frequency error of itemsets is obtained by using the relevant conclusions in statistical learning theory. Another method directly uses the deformation of the classical Bernstein inequality to define the absolute frequency error of the item set. Then the PAC learning framework is applied to the analysis of the two frequent itemsets sampling methods. By using the (?, 未) -approximation of the frequent itemsets, the approximate sample boundaries satisfying the integrity of the PAC samples are derived. Finally, the algorithm selects the corresponding data set samples according to the boundary value to mine approximate frequent itemsets. Since the problem of frequent itemsets discovery has been raised, many literatures have been devoted to the research of frequent itemsets discovery methods based on sampling technology. At the same time, with the emergence of single machine serial computing bottleneck, more and more research work is carried out to find frequent itemsets based on parallel / distributed computing platform. The extended experiment in this paper uses both the real retail data set and the artificial data set T10I4D100K. the recall rate of the approximate frequent itemsets returned from the data set sample mining experiment is obtained respectively. The correctness of the method is evaluated from three aspects: accuracy and frequency error estimation. At the same time, the running time of the method is evaluated comprehensively from two aspects: serial computing (based on single computer) and parallel computing (based on Hadoop pseudo-distributed platform). Finally, the performance of different theoretical methods and the requirements of sample space are compared. According to the current knowledge, the empirical Rademacher complexity and Bernstein inequality are applied to the frequent itemset discovery problem based on sampling technique for the first time. A relatively compact theoretical sample boundary is derived by using the relevant conclusions in statistical learning. This boundary is more compact than the frequent itemsets based on VC dimension proposed by Riondato and Upfal, so it has better execution efficiency. At the same time, the sample complexity proposed by our method is the inverse of the approximate error. And the polynomial function of the reciprocal 1 / 未 of the confidence parameter is also PAC's learnable. We use some classical concepts and techniques in computational learning theory to solve a practical problem, which may also be applied to other aspects of data mining.
【学位授予单位】:云南财经大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
,
本文编号:2358832
[Abstract]:With the advent of big data era and the rapid growth of data set capacity, frequent pattern mining based on parallel / distributed computing has obvious advantages over traditional memory and node constrained technologies in processing large data sets. Under the current background, this paper proposes an efficient association analysis algorithm based on random sampling to find approximate frequent itemsets from large data sets. The core of this algorithm is to select a random sample of data set to mine approximate frequent itemsets. Standard Apriori and FP-Growth algorithms often need to scan the whole data set several times to find specific frequent itemsets, which is bound to pay a high cost of storage and computation for large data sets. In order to mine big data sets more effectively, this paper proposes two new techniques based on random sampling to extract high quality approximate frequent itemsets from transactional dataset samples. A higher probability is given to ensure that the approximate itemset is a superset of the true frequent itemsets in the dataset. In one method, an approximate upper bound of empirical Rademacher complexity based on data set samples is derived by applying empirical Rademacher complexity in computational learning theory and set inequality. Then an approximate upper bound of the upper bound of absolute frequency error of itemsets is obtained by using the relevant conclusions in statistical learning theory. Another method directly uses the deformation of the classical Bernstein inequality to define the absolute frequency error of the item set. Then the PAC learning framework is applied to the analysis of the two frequent itemsets sampling methods. By using the (?, 未) -approximation of the frequent itemsets, the approximate sample boundaries satisfying the integrity of the PAC samples are derived. Finally, the algorithm selects the corresponding data set samples according to the boundary value to mine approximate frequent itemsets. Since the problem of frequent itemsets discovery has been raised, many literatures have been devoted to the research of frequent itemsets discovery methods based on sampling technology. At the same time, with the emergence of single machine serial computing bottleneck, more and more research work is carried out to find frequent itemsets based on parallel / distributed computing platform. The extended experiment in this paper uses both the real retail data set and the artificial data set T10I4D100K. the recall rate of the approximate frequent itemsets returned from the data set sample mining experiment is obtained respectively. The correctness of the method is evaluated from three aspects: accuracy and frequency error estimation. At the same time, the running time of the method is evaluated comprehensively from two aspects: serial computing (based on single computer) and parallel computing (based on Hadoop pseudo-distributed platform). Finally, the performance of different theoretical methods and the requirements of sample space are compared. According to the current knowledge, the empirical Rademacher complexity and Bernstein inequality are applied to the frequent itemset discovery problem based on sampling technique for the first time. A relatively compact theoretical sample boundary is derived by using the relevant conclusions in statistical learning. This boundary is more compact than the frequent itemsets based on VC dimension proposed by Riondato and Upfal, so it has better execution efficiency. At the same time, the sample complexity proposed by our method is the inverse of the approximate error. And the polynomial function of the reciprocal 1 / 未 of the confidence parameter is also PAC's learnable. We use some classical concepts and techniques in computational learning theory to solve a practical problem, which may also be applied to other aspects of data mining.
【学位授予单位】:云南财经大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
,
本文编号:2358832
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2358832.html