基于云平台的机器学习算法并行化研究与应用
发布时间:2018-03-07 11:34
本文选题:云计算 切入点:Spark 出处:《内蒙古师范大学》2016年硕士论文 论文类型:学位论文
【摘要】:随着信息化时代的到来,数据成为了最为宝贵的资源,各行各业可处理的数据以指数形式增长,包括电子商务网站的各种商务数据、银行的各种业务数据以及生物体的各种基因组数据等等,这种爆炸式的数据增长,很难在已有的平台中得到有效的处理。目前,Hadoop平台是在大数据中挖掘出有用信息一种相对高效率的并行化新技术,使用MapReduce(MR)编程框架,数据量越大,这种技术越能体现出其独特的优势。Mahout是一种开源的机器学习(ML)算法库属于Apache社区,基于Hadoop平台的MR计算框架,为程序开发者提供高效的算法实例。由于机器学习算法基本属于迭代计算,而MR将中间数据存放在分布式文件处理系统(HDFS)上,其具有I/O资源消耗高的局限性。原于Mahout机器学习库的缺陷,Spark计算框架应运而生,Spark主要基于弹性分布式数据集(RDD),RDD是分布式内存的一个抽象概念,降低了I/O资源消耗和容错能力的开销。Spark同样可以搭建在Hadoop YARN平台上,分布式存储数据。伴随着Spark MLlib的出现,使机器学习算法的并行化研究有了质的提升。本文主要研究基于Spark MLlib的聚类算法K-means和分类算法决策树及其组装树随机森林用来解决单机无法处理的基因组数据问题。K-means算法作为数据处理的第一步,用于找到最佳的类别个数,第二步使用分类算法随机森林基于已有的类别训练出模型,用于后续的类别预测。本文算法的研究主要应用在基因组数据的分析上,但不仅限于此,基于云平台和Spark的机器学习算法具有良好的扩展性。实验表明,基于Spark的机器学习算法可以有效的提高对基因组大数据的分析,从而对基因组数据的科学研究起到积极的促进作用。
[Abstract]:With the advent of the information age, data has become the most valuable resource. The data that can be handled by various industries has increased exponentially, including all kinds of commercial data of e-commerce websites. All kinds of data from banks and genomes of organisms, and so on, this explosive growth of data, At present, Hadoop platform is a relatively efficient parallel technology to mine useful information from big data. Using MapReduceMRS programming framework, the larger the amount of data, the greater the amount of data. The more this technology shows its unique advantage. Mahout is an open source machine learning algorithm library belonging to the Apache community, based on the Hadoop platform of Mr computing framework, Because the machine learning algorithm basically belongs to iterative computation, Mr stores the intermediate data on the distributed file processing system (HDFS). It has the limitation of high consumption of I / O resources. The Spark computing framework, which was originally based on the Mahout machine learning library, came into being as an abstract concept of distributed memory, which is mainly based on the elastic distributed data set. This reduces the overhead of I / O resource consumption and fault tolerance. Spark can also be built on the Hadoop YARN platform to store data distributed. With the advent of Spark MLlib, This paper mainly studies K-means clustering algorithm based on Spark MLlib and decision tree and its assembly tree to solve the problem of genome data which can not be processed by single machine. K-means algorithm as the first step in data processing, In order to find the best number of categories, the second step is to use the classification algorithm, a random forest, to train a model based on the existing categories, which can be used to predict the following categories. The research of this algorithm is mainly applied to the analysis of genomic data, but not limited to this. The machine learning algorithm based on cloud platform and Spark has good expansibility. Experiments show that the machine learning algorithm based on Spark can effectively improve the analysis of genome big data. Therefore, it plays an active role in promoting the scientific research of genomic data.
【学位授予单位】:内蒙古师范大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13;TP181
,
本文编号:1579113
本文链接:https://www.wllwen.com/jingjilunwen/dianzishangwulunwen/1579113.html