数据挖掘中分类算法的比较分析

发布时间：2018-04-08 11:01

本文选题：数据挖掘　切入点：分类　出处：《天津财经大学》2016年硕士论文

【摘要】：数据的快速大量增长以及广泛可用使我们步入了真正的数据时代。如何从海量数据中挖掘出有价值的信息已成为人们关注的重点,尤其是数据挖掘中的分类技术研究。从商业领域到工程实践领域再到生物医学领域,只要是涉及将地区、商品、人群等目标变量按照不同属性区分开来的问题,都可以使用分类技术解决。分类算法多种多样,其中常用算法主要有朴素贝叶斯方法、决策树、支持向量机、集成学习等等。然而,没有任何一种算法能够适用于所有实际问题,每种分类算法均有其各自不同特点。人们开始不仅仅满足于通过分类技术对数据集进行分析建模来为决策者提供更好的决策依据,同时更加希望能够提高解决分类问题的效率,以创造更多的价值。因此,为高效解决好分类问题,找到不同分类算法的适用环境及其优势与不足,甚至实现分类模型的自动筛选功能以提高解决分类问题的效率已成为重要需求。但国内很少有学者对原有算法的应用进行比较,国外学者Michieet曾对神经网络、统计分类和机器学习三种分类技术进行了比较,并将其应用到了实际工业问题当中。而本文将更具体的对朴素贝叶斯、决策树C5.0、支持向量机三种分类算法进行比较。在对以上算法原理与分类结果比较准则进行基础介绍之后,分别选用了涉及社会、商业、生物、经济领域的四个具有不同实例数、缺失值个数、用于预测的属性个数、目标类别个数的具有一定代表性的实验案例,然后应用以上三种典型分类算法分别建立分类模型,最后在分类结果准确率、分类算法稳定性、分类算法所得结果的可解释性、分类器运行速度、处理含有缺失值数据集的效果等方面分别对三种算法进行比较与分析,得出了以上三种算法应用于不同特点数据集时的优点与不足。最终通过实验结果发现,支持向量机分类算法在对历史数据的依赖性、分类结果准确率、算法稳定性方面,较其他两种分类算法均具有明显的优势。决策树算法在运行速度、所得结果的可解释性方面,较其他两种算法均具有明显的优势。朴素贝叶斯算法在处理含有缺失值数据集时效果均好于其他两个算法。因此,当实际问题中所能获取的样本量相对较少时,采用支持向量机算法的效果最佳,而面对海量数据,决策树算法最具效率,当所收集到的数据集包含大量缺失值时,朴素贝叶斯算法的效果更好。
[Abstract]:The rapid growth and wide availability of data has ushered in a real data age.How to mine valuable information from massive data has become the focus of attention, especially the research of classification technology in data mining.From the commercial field to the engineering practice field to the biomedical field, as long as the target variables, such as region, commodity, crowd and so on, are distinguished according to different attributes, the classification technology can be used to solve the problem.There are many kinds of classification algorithms, such as naive Bayes method, decision tree, support vector machine, ensemble learning and so on.However, none of the algorithms can be applied to all practical problems, and each classification algorithm has its own characteristics.People are not only satisfied with the analysis and modeling of data sets through classification technology to provide better decision basis for decision makers, but also hope to improve the efficiency of solving classification problems and create more value.Therefore, in order to efficiently solve the classification problem, find out the applicable environment, advantages and disadvantages of different classification algorithms, and even realize the automatic screening function of classification model to improve the efficiency of classification problems has become an important requirement.However, few domestic scholars have compared the application of the original algorithm. Michieet, a foreign scholar, has compared the neural network, statistical classification and machine learning techniques, and applied them to practical industrial problems.In this paper, three classification algorithms, namely naive Bayes, decision tree C5.0 and support vector machine, are compared in detail.After the basic introduction of the above algorithm principle and the comparison criterion of classification results, the four fields of social, commercial, biological and economic are selected respectively, which have different instance numbers, missing values, and the number of attributes used for prediction.The experimental cases of the number of target categories are representative, and then the classification models are established by using the above three typical classification algorithms. Finally, the accuracy of the classification results, the stability of the classification algorithm, the interpretability of the results obtained by the classification algorithms are discussed.This paper compares and analyzes the three algorithms in terms of the running speed of classifier and the effect of dealing with data sets with missing values. The advantages and disadvantages of the above three algorithms when applied to different characteristic data sets are obtained.Finally, the experimental results show that the SVM classification algorithm has obvious advantages over the other two classification algorithms in terms of dependence on historical data, accuracy of classification results and stability of the algorithm.Decision tree algorithm has obvious advantages over other two algorithms in terms of running speed and interpretability of the results obtained.The naive Bayes algorithm is better than the other two algorithms in dealing with data sets with missing values.Therefore, when the sample size is relatively small in practical problems, the support vector machine algorithm is the best, and the decision tree algorithm is the most efficient in the face of massive data, when the collected data set contains a large number of missing values.The effect of naive Bayes algorithm is better.
【学位授予单位】：天津财经大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】