海量数据分布式存储技术的研究与应用

发布时间：2019-01-02 08:44

【摘要】：近年来,随着信息技术的蓬勃发展,互联网上业务不断地扩张,用户不断地增加,存储空间不断地增大,数据呈现出无法想象的增长趋势。然而存储容量往往同存储性能总成反比,传统数据库在应付海量数据时显得十分吃力,暴露出并发性低、扩展性差、效率低下等问题。因此,海量数据存储成为重点研究对象,基于MPP(Massive Parallel Processing)架构的并行处理分布式数据库就是其中的一个研究方向。本文对海量数据存储技术做了探索性的研究,选题自“十一五"国家科技重点支撑项目——安全可信的电信级生殖健康服务运营支撑体系关键技术研究,主要解决项目中数据量不断扩大带来的存取性能问题,为项目提供高并发性、高可用性、高扩展性的存储技术支持。本文的所做的研究工作主要包括以下几个方面：1、基于海量数据存储技术、关系型数据与NoSQL数据模型、分布式数据库存储和基于MPP架构的并行处理模式的理论,总结了海量数据存储的方案和应用到的新技术。2、分析了海量数据存储技术特点、比较了国内外常用的分布式海量数据存储技术的优缺点,设计了海量数据的分布存储模型,并详细阐述了SQL解析模块、数据切分模块、并行查询模块以及结果模块的实现方法。3、在海量数据存储模型设计和数据并行查询存储技术的基础上,自主研发了基于MPP架构的存储架构‘'DB Mapping"系统,实现了具有良好的扩展性和大规模并行处理的优势的海量数据存储解决方案。论文主要贡献是,提出了一种基于MPP架构的并行处理的海量数据存储方法,提出了从客户端发起请求到数据持久化的全程的数据存储方式,并融合了Map/Reduce的思想,将工作分发到各个数据节点,实现了数据的高可扩展性、高可用性、高并发性。并通过搭建分布式数据节点进行仿真测试,验证了该海量数据存储方式的可行性。
[Abstract]:In recent years, with the rapid development of information technology, business on the Internet continues to expand, users continue to increase, storage space continues to increase, data shows an unimaginable growth trend. However, the storage capacity is often inversely proportional to the storage performance. The traditional database is very difficult to deal with the massive data, which exposes the problems of low concurrency, poor expansibility, low efficiency and so on. Therefore, mass data storage has become an important research object, and parallel processing distributed database based on MPP (Massive Parallel Processing) architecture is one of the research directions. This paper has done the exploratory research on the massive data storage technology, selected topics from the "11th Five-Year Plan" national key science and technology support project-safe and credible telecom grade reproductive health service operation support system key technology research. It mainly solves the problem of access performance caused by the increasing amount of data in the project, and provides high concurrency, high availability and high scalability storage technology support for the project. The research work of this paper mainly includes the following aspects: 1. Based on the massive data storage technology, the theory of relational data and NoSQL data model, distributed database storage and parallel processing mode based on MPP architecture. This paper summarizes the scheme and new technology of mass data storage. 2, analyzes the characteristics of mass data storage technology, compares the advantages and disadvantages of distributed mass data storage technology used at home and abroad, and designs a distributed storage model of mass data. The implementation methods of SQL parse module, data segmentation module, parallel query module and result module are described in detail. 3. Based on the design of massive data storage model and the technology of data parallel query storage. A storage architecture'DB Mapping 'system based on MPP architecture is developed in this paper. The solution of mass data storage with good scalability and large scale parallel processing is realized. The main contributions of this paper are as follows: a parallel data storage method based on MPP architecture is proposed, a data storage method from client initiation request to data persistence is proposed, and the idea of Map/Reduce is integrated. The work is distributed to each data node to achieve high scalability, high availability and high concurrency. The feasibility of the massive data storage method is verified by building distributed data nodes for simulation test.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333;TP311.13

【参考文献】