基于Hadoop的海量期货数据的分布式存储和算法分析

发布时间：2018-01-20 04:55

本文关键词： Hadoop 期货海量数据存储数据挖掘分布式　出处：《天津大学》2012年硕士论文　论文类型：学位论文

【摘要】：期货交易作为一种重要的投资和保值工具,近年来得到了快速的发展,随之而产生的数据也在日益增长,而加快对期货数据的信息资源的整合利用的重要性也就日渐突出。我们可以通过数据挖掘和统计等工具从中发现具有重要价值的信息,传统的数据挖掘模式可以做到这一点,但是随着数据量的不断上涨,出现了一些制约传统数据挖掘模型的因素。首先是对海量数据的存储问题,面对上TB,PB级的数据,传统的商业单机存储已经不能满足要求,其次在如此大规模的数据上进行数据挖掘分析,传统的单机算法所消耗的时间也变得让人难以忍受。在本文中,我们提出一种针对期货行业的海量数据,运用商业计算机集群来实现数据的分布式存储和并行数据挖掘的解决方案。这一方案的实现的基础是由Doug Cutting开发的Hadoop。该框架是由java实现的开源分布式计算框架,其基础为HDFS和Mapreduce,在其上所构建的分布式应用具有很强的规模性,可扩展性和容错性。方案由总体设计和具体实现两部分。首先,我们提出了一种适用于海量数据存储和挖掘的体系结构,该结构用到了软件体系结构中比较著名的层次结构模型,这种设计使得我们的应用具有很强的灵活性和可扩展性。另外,我们针对各层进行了简单的实现,这些实现包括:web前端,Web service控制层,数据挖掘插件,Hbase存储四个部分,其中对于数据挖掘插件的开发我们进行了较为详细的说明。在实现方案中,首先我们在页面上使用WebService和Ajax技术来进行参数的提交,通过这两者我们节省了网络带宽,同时达到了消除异构性的目的。在后台,我们通过Spring的Ioc容器来启动服务,减小了代码的侵入性,同时也很好地管理了服务之间的相互依赖。在数据挖掘插件的开发方面,我们实现了Parallel FP-Growth算法,使用了maven来进行插件的开发,这使得我们的应用更加的具有可管理性和复用性。数据存储方面我们用到了基于列的分布式数据库Hbase,其对于海量数据的存储有很大的优势。
[Abstract]:Futures trading as an important tool for investment and preservation of value, in recent years has been rapid development, and the resulting data are also increasing day by day. The importance of accelerating the integration and utilization of information resources of futures data is becoming more and more prominent. We can find important information through data mining and statistics tools. The traditional data mining model can do this, but with the increasing of data volume, there are some factors that restrict the traditional data mining model. PB level data, the traditional commercial single computer storage can not meet the requirements, and then on such a large scale of data mining analysis, the traditional single-machine algorithm consumption time has become intolerable. In this paper, we propose a huge amount of data for futures industry. A solution for distributed data storage and parallel data mining using a cluster of commercial computers. The implementation of this solution is based on Doug. Hadoop, developed by Cutting, is an open source distributed computing framework implemented by java. It is based on HDFS and Mapreduce.The distributed application built on it has strong scale, extensibility and fault-tolerance. The scheme consists of two parts: the overall design and the concrete implementation. First of all. We propose an architecture for mass data storage and mining, which uses the well-known hierarchical model in the software architecture. This design makes our application very flexible and extensible. In addition, we have a simple implementation for each layer, which includes the: Web front-end. Web service control layer, data mining plug-in Hbase storage four parts, which for the development of data mining plug-in we carried out a more detailed description. In the implementation, first of all, we use WebService and Ajax technology to submit the parameters on the page, through which we save the network bandwidth. At the same time, the purpose of eliminating isomerism is achieved. In the background, we start the service through the Ioc container of Spring, which reduces the intrusiveness of the code. At the same time, we also manage the interdependence between services. In the development of data mining plug-in, we implement the Parallel FP-Growth algorithm. We use maven for plug-in development, which makes our application more manageability and reusability. We use column-based distributed database Hbase for data storage. It has a great advantage for the storage of massive data.
【学位授予单位】：天津大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP333

【引证文献】