面向离散属性的决策树分类方法研究

发布时间：2018-05-21 05:16

本文选题：数据挖掘 + 决策树　；参考：《大连海事大学》2017年硕士论文

【摘要】：数据挖掘是指在大量已存在的数据中发现规律的一个过程。近年来,在大量数据中智能提取知识已经引起了业界广泛的关注。数据挖掘领域包括分类、聚类、聚簇、关联分析等各种挖掘方法。决策树算法因它提取知识简单、高效、易于理解等优点,在数据挖掘领域中占有无可替代的地位。在已有的决策树算法中,计算决策树分裂结点的标准大多以香农的信息熵为基础,信息熵需反复地进行对数运算,分类效率不高。又因已有算法在选择候选结点时的随机性,使分类器无法进一步选择判断属性分裂标准相同时的情况,进而降低预测分类的准确率。本文针对已有决策树算法的缺点,提出以下改进:(1)本文针对已有决策树算法分类效率不高的问题,为避免复杂的对数运算,提高CPU的利用率,提出了改进的属性判断标准的优化函数。对比实验显示该优化函数能有效提高分类效率和CPU的利用率。(2)本文针对生成后的决策树分类器精确率低的问题,为避免当两个或更多的属性判断标准的计算值接近某个阈值或相等,随机选择一个结点作为下一个属性分裂的结点,进一步引入了一个基于堆的属性判断方法,以此来提高分类精确率。通过实验验证,该方法可以有效提高某些特定数据集的分类精确率。(3)本文进一步针对决策树分类精确率不高以及过度拟合的问题,引入了基于分类规则的方法。利用改进的决策树算法N次随机抽样生成N个决策树分类器,再从这些分类器中挑选出最优的分类规则,生成最终的决策树模型。经过实验验证,该算法相比已有算法,在分类效率和分类准确率上都有相应的提高。
[Abstract]:Data mining is a process of discovering laws in a large number of existing data. In recent years, intelligent knowledge extraction in a large number of data has attracted wide attention in the industry. Data mining includes classification, clustering, association analysis and other mining methods. Decision tree algorithm plays an irreplaceable role in the field of data mining because it is simple, efficient and easy to understand. In the existing decision tree algorithms, most of the criteria for computing decision tree splitting nodes are based on Shannon's information entropy, which needs repeated logarithmic operations, so the classification efficiency is not high. Because of the randomness of the existing algorithms in selecting candidate nodes, the classifier is unable to further select the case where the criterion of attribute splitting is the same, thus reducing the accuracy of prediction classification. In order to avoid the complex logarithmic operation and improve the utilization of CPU, this paper aims at the problem that the classification efficiency of the existing decision tree algorithm is not high. An improved optimization function of attribute judgment criterion is proposed. The comparison experiment shows that the optimized function can effectively improve the classification efficiency and the utilization ratio of CPU.) in this paper, we aim at the problem of low accuracy rate of the decision tree classifier. In order to avoid when two or more attribute judgment criteria are close to a threshold value or equal, a heap based attribute judgment method is further introduced by randomly selecting one node as the next attribute split node. In order to improve the accuracy of classification. Experimental results show that this method can effectively improve the classification accuracy rate of some specific data sets. (3) in this paper, we further introduce a method based on classification rules to solve the problem of low classification accuracy rate and over-fitting of decision trees. The improved decision tree algorithm is used to generate N decision tree classifiers by random sampling, and the optimal classification rules are selected from these classifiers to generate the final decision tree model. The experimental results show that compared with the existing algorithms, the proposed algorithm can improve the classification efficiency and classification accuracy.
【学位授予单位】：大连海事大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】