基于Hadoop平台的决策树算法并行化研究
发布时间:2018-04-26 07:03
本文选题:云计算 + Hadoop ; 参考:《华东师范大学》2012年硕士论文
【摘要】:云计算的概念是Google首席执行官在2006年搜索引擎大会上首次提出的,在此后的6年间,云计算这一概念得到了广泛的传播,Microsoft、Google、IBM等等知名公司都相继开展了云计算的相关研究。越来越多云计算平台的出现也使获得可扩展的、廉价的、高效的计算模式成为可能。 现代社会信息增长快速,预计超过1/3的数字信息将被驻留在云计算平台中或借助云计算平台处理,随之而来的是社会各界对多元化数据挖掘服务的需求,基于云计算平台进行高效、可信的海量数据挖掘成为一个具有挑战性的难题。 本文首先研究了Google、IBM、Hadoop等等云计算平台,着重分析了Hadoop平台的关键技术MapReduce编程模型和Hadoop分布式文件系统。然后,比较深入地研究了决策树分类算法,分析了几个常用的决策树分类算法。在此基础上,本文针对两种典型的决策树分类算法C4.5算法和SPRINT算法,提出了它们在Hadoop平台上的改进方法和并行化策略。实验结果表明,对海量数据,改进后的这两种算法在Hadoop平台上都具有较高的加速比,在一定程度上解决了C4.5算法和SPRINT算法在处理海量数据时计算量大、构建决策树时间长的问题。
[Abstract]:The concept of cloud computing was first put forward by the chief executive of Google at the 2006 search engine conference. In the following six years, the concept of cloud computing has been widely disseminated, such as Microsoft Google, IBM and other well-known companies have carried out research on cloud computing. The emergence of more and more cloud computing platforms also makes it possible to obtain scalable, inexpensive and efficient computing models. With the rapid growth of information in modern society, it is expected that more than a third of the digital information will be hosted on or processed by cloud computing platforms, followed by the demand for diversified data mining services from all walks of life. Efficient and reliable massive data mining based on cloud computing platform has become a challenging problem. In this paper, we first study the cloud computing platform such as Hadoop, IBM and Hadoop, and analyze the key technology of Hadoop platform, MapReduce programming model and Hadoop distributed file system. Then, the decision tree classification algorithm is deeply studied, and several commonly used decision tree classification algorithms are analyzed. On this basis, two typical decision tree classification algorithms, C4.5 algorithm and SPRINT algorithm, are proposed in this paper, and their improved methods and parallelization strategies on Hadoop platform are proposed. The experimental results show that the two improved algorithms have a high speedup on Hadoop platform, and to some extent solve the problem that C4.5 algorithm and SPRINT algorithm have a large amount of computation in dealing with the massive data. The problem of long time to construct decision tree.
【学位授予单位】:华东师范大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP311.13
【参考文献】
相关期刊论文 前10条
1 孙健;贾晓菁;;Google云计算平台的技术架构及对其成本的影响研究[J];电信科学;2010年01期
2 范冬梅;卢志茂;张汝波;潘树q,
本文编号:1804982
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1804982.html