高维数据下决策树的快速构造

发布时间：2018-04-26 02:05

本文选题：高维数据 + 决策树　；参考：《中国科学技术大学》2017年硕士论文

【摘要】：从数据中挖掘知识和信息已成为解决许多实际问题的重要手段。决策树是最常用的数据挖掘算法之一。但现有决策树算法处理高维数据时存在计算量大、资源占用多的缺点。本论文面向高维数据,研究决策树的快速构造方法。首先,为减少构建决策树的计算量,我们提出了基于混淆度的启发式决策树构建算法。该算法利用父节点的计算结果估计部分子节点的上界,从而削减了找到子节点最优解的计算量。实验结果表明无论是单棵决策树还是集成决策树,该算法都不会对决策树的模型准确度、概念简洁性造成负面影响,并且在数据维度大于1000的高维情形下可以降低约70%的计算量。其次,为优化决策树构建过程中的资源占用和磁盘负载,我们提出了一种基于横纵划分的决策树并行构造方式。和传统方法相比,该方法的集群内存占用量从O(T)降为O(√T),其中T是并行进程数。对应的单并行进程的内存占用量从O(1)降至O(1/√T),即集群的扩大和并行数的增加可以降低单进程的内存占用量。数学分析和实验结果表明,该方法对网络通信量、磁盘读写量、计算量没有负面影响,并且在不同规模的集群上都取得了良好的并行效率。
[Abstract]:Mining knowledge and information from data has become an important means to solve many practical problems. Decision tree is one of the most commonly used data mining algorithms. However, the existing decision tree algorithms have the disadvantages of large computation and large resource consumption in dealing with high dimensional data. In this paper, the fast construction method of decision tree is studied for high dimensional data. Firstly, in order to reduce the computational cost of constructing decision tree, we propose a heuristic decision tree construction algorithm based on degree of confusion. In this algorithm, the upper bound of some child nodes is estimated by the result of the calculation of the parent node, thus reducing the computational cost of finding the optimal solution of the child node. The experimental results show that neither single decision tree nor integrated decision tree has a negative effect on the model accuracy and conciseness of the decision tree. And the computation can be reduced by about 70% when the data dimension is larger than 1000. Secondly, in order to optimize the resource occupation and disk load in the process of constructing decision tree, we propose a parallel construction method of decision tree based on horizontal and vertical partition. Compared with the traditional method, the cluster memory footprint of the proposed method is reduced from OT to O (T ~ 2, where T is the number of parallel processes). The memory footprint of the corresponding single parallel process is reduced from O1) to O1 / m2, that is, the expansion of cluster and the increase of parallel number can reduce the memory footprint of single process. The mathematical analysis and experimental results show that the proposed method has no negative effect on network traffic, disk read and write, and computation, and achieves good parallel efficiency on clusters of different scales.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【相似文献】