面向海量数据处理领域的云计算及其关键技术研究

发布时间：2018-06-28 10:56

本文选题：海量数据处理 + 云计算　；参考：《南京理工大学》2013年博士论文

【摘要】：随着信息技术的飞速发展,在许多科学领域中,数据爆炸已成为一个突出的问题。海量数据在提供丰富信息,与扩大人们视野的同时,也带来了数据处理和存储等方面的难题,其主要表现在以下几个方面：不同信息系统中存在着大量异构数据源；数据缺乏统一的规范化组织方法；在某些领域,海量数据是以大量小文件形式存在,难以有效分析处理；此外,还需要解决海量数据的高效存储问题等。近年来,云计算技术的不断成熟和发展,为海量数据处理提供了一种新的有效方法。本文以海量数据为研究对象,深入研究了云计算相关理论,并结合有关前沿思想,突破了云计算在海量数据处理中的若干关键技术,建立了一套行之有效的海量数据分析处理方法。本文的主要内容如下： (1)在已有云平台各自特点基础上,整合开源云平台用于处理和存储海量数据,建立了一种新的基于云计算环境的海量小文件处理模型C-MSFPM (Cloud computing-Massive Small Files Process Model)。该模型针对小文件处理的特点,通过基于MapReduce和特征向量减少的改进KNN算法的进行文件分类,建立文件索引机制,以及就近原则和权值相似度的文件合并算法,对海量小文件进行处理。 (2)在海量小文件处理模型C-MSFPM基础上,针对文件查询过程中的复杂处理及内容映射,构建了基于XML和多Value的改进MapReduce模型。该模型使用XML标记数据的内容、坐标、操作映射等信息。对于海量数据的复杂处理,内容映射的查询,通过XML标记及Map过程中的多Value处理,一次定位即可查询到与数据相关的所有信息,极大地提高了数据处理效率。在此基础上,针对海量PDF小文件的内容映射查询、排序,通过实验进行多组数据的对比,试验表明了模型的算法正确,性能可靠。对于基于云平台的车载信息数据处理,通过引进资源池策略,解决海量数据传输中的数据包丢失问题。 (3)针对云存储的问题,分析云存储中的协调机制和虚拟化,从虚拟节点的性能引伸出虚拟存储节点存储效率值的概念,并讨论了云存储机制和任务调度。提出基于改进遗传算法的存储任务分配机制和基于改进动态规划的云存储数据分配策略。这两种算法大幅提高了存储节点的利用率和优化了系统负载均衡。
[Abstract]:With the rapid development of information technology, data explosion has become a prominent problem in many fields of science. Mass data not only provides abundant information, but also brings problems in data processing and storage, while expanding people's vision. It mainly shows in the following aspects: there are a large number of heterogeneous data sources in different information systems; In some fields, the massive data is in the form of a large number of small files, it is difficult to effectively analyze and process, in addition, we also need to solve the problem of efficient storage of mass data. In recent years, cloud computing technology continues to mature and develop, which provides a new and effective method for mass data processing. This paper takes massive data as the research object, deeply studies the cloud computing related theory, and combines the related frontier thought, breaks through some key technologies of cloud computing in the massive data processing, A set of effective analysis and processing method for mass data is established. The main contents of this paper are as follows: (1) based on the existing cloud platform, the open source cloud platform is integrated to process and store massive data. A new cloud computing-passive small Files process model (C-MSFPM) is proposed in this paper. According to the characteristics of small file processing, this model classifies files based on MapReduce and feature vector reduction, establishes file index mechanism, and combines file merging algorithm based on proximity principle and weight similarity. (2) based on C-MSFPM, an improved MapReduce model based on XML and multi-value is constructed for the complex processing and content mapping in the process of file query. The model uses XML markup data content, coordinates, operational mapping and other information. For the complex processing of massive data and the query of content mapping, all the information related to the data can be queried at one time by XML markup and multi-value processing in Map process, which greatly improves the efficiency of data processing. On this basis, the content mapping query and sorting of mass PDF small files are carried out. The experiments show that the algorithm of the model is correct and the performance of the model is reliable. For vehicle information data processing based on cloud platform, the problem of data packet loss in mass data transmission is solved by introducing resource pool strategy. (3) aiming at the problem of cloud storage, the coordination mechanism and virtualization in cloud storage are analyzed. The concept of storage efficiency value of virtual storage node is derived from the performance of virtual node, and the cloud storage mechanism and task scheduling are discussed. A storage task allocation mechanism based on improved genetic algorithm and a cloud storage data allocation strategy based on improved dynamic programming are proposed. These two algorithms greatly improve the utilization of storage nodes and optimize system load balancing.
【学位授予单位】：南京理工大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】