物联网大数据存储与管理技术研究

发布时间：2018-02-15 22:51

本文关键词： 物联网大数据分布式文件系统数据检索数据立方体节能任务调度　出处：《中国科学技术大学》2017年博士论文　论文类型：学位论文

【摘要】：物联网(The Internet of Things,IoT)是一个将海量传感设备与互联网相结合起来而形成的巨大网络。在物联网中,海量传感设备不断地采集数据并发送到数据中心;随着感知技术与网络技术的不断发展,数据呈现出海量特性,形成了物联网大数据。对物联网大数据进行持久化存储,可以获得任一传感器的历史与当前感知数据,通过对数据进行检索和统计分析,可以实现复杂与规律的感知和趋势分析;数据存储与管理以流任务运行在数据中心中,通过节能任务调度,降低物联网应用的成本。这些都为城市安全、智慧城市、目标识别与跟踪、位置服务等诸多领域带来了新的机遇。物联网大数据的存储与管理,需要持久化存储数据,实时检索数据,对数据进行及时的分析和处理,并提供高效的计算框架,最终对数据实现有效的感知与控制。但是,物联网大数据的海量特性为数据的存储与管理带来了巨大的挑战。首先,"持久化存储",海量传感器频繁地产生新的采集数据,并发送到数据中心,形成了每秒数GB的数据写入流,对HDFS等传统持久化存储系统带来了巨大的挑战。在以HDFS为代表的大规模分布式文件系统中,虽然它们支持大数据存储,但由于这些文件系统在设计时并没有考虑对实时、高性能的数据存储,因此无法满足日益增长的大数据在线存储的需求,例如HDFS在面对海量小文件的数据流时,单机性能往往下降到数MB/s,远远满足不了实际需求。第二,"数据检索",存储在持久化设备中的数据,需要借助数据检索系统,快速查找数据,但是目前以关系数据库、NoSQL数据库为主的数据库系统不能有效满足物联网大数据的检索需求,例如NoSQL数据库设计了基于磁盘存储的读写方式、索引结构、查询执行、查询优化、恢复策略,但是磁盘固有的读写性能差等弊端限制了大数据存储尤其是大数据分析性能的提升。第三,"数据统计分析",这需要建立数据立方体,以实现高效的数据统计分析。但是目前传统的数据立方体,如HIVE等,都只能针对确定型数据进行统计分析,当面对物联网中的概率型数据时,统计分析的时间开销为"小时"级别,不能满足实际应用的需求。最后,数据的存储、检索、分析都以流任务的形式运行在数据中心之中,数据中心的运维成本有40%为能耗成本,如何实现节能任务调度就成为了降低数据中心成本的关键,而目前以Hadoop YARN为代表的任务调度平台不支持节能任务调度。综上所述,目前许多已有的数据存储与管理技术在面对物联网大数据时,都存在着局限性。针对上述问题,本文提出一种"面向物联网大数据的数据存储与管理系统框架"(Sensor Storage)。Sensor Storage是一个分布式的数据存储、检索、分析平台,主要包括以下关键技术。(1)面向海量小文件的分布式文件系统。本研究建立一个基于HDFS扩展的分布式存储系统SensorFS,该系统架构可以对海量小文件进行快速存储、查询优化,并提供高可扩展性、数据安全性保障;本研究提出海量小文件的写吞吐优化机制以及算法,对小文件写瓶颈进行理论分析与建模,设计小文件写优化策略;提出海量小文件在HDFS中的文件读取性能优化机制;(2)一种空间有效的键值数据检索系统。本研究建立一个基于Radix Tree的键值数据检索系统RadixKV,为分布式文件系统中的海量内容提供基于关键词的快速数据检索服务;本研究分析了Radix Tree的优势与不足,对Radix Tree的在线更新性能进行分析,并设计了一种自适应并行索引更新策略;提出了一种空间开销优化的Radix Tree表达方式——Radix Array,设计了 Radix Array的数据结构,并分析了 Radix Array的空间开销。(3)面向概率型数据的数据立方体系统。分析物联网大数据中的"不确定性"特点,并有针对性地设计面向概率数据的数据立方体系统ProbabilisticCube,提供面向概率型数据的快速聚集查询服务;定义物联网大数据中的概率数据模型,并基于概率数据模型定义、设计概率数据立方体;设计高性能的概率数据聚集操作;设计基于物化代价估计模型的数据立方体物化实现策略;设计面向概率数据的切片查询和切块查询。(4)能耗有效的任务调度框架。建立一个基于Hadoop YARN扩展的分布式任务调度框架Green Yarn,新的分布式任务调度框架对物联网的流任务进行合理调度,在不损失性能的前提下,结合服务器动态电压调整的特性(DVFS),对任务和服务器进行合理匹配;我们设计基于任务的能耗有效性模型,并设计分别面向离线批处理任务和在线任务的任务调度算法。通过本文系统研究,有望建立一个面向物联网大数据的新型存储架构,对文件系统、大数据检索与分析提出创新的优化设计,解决其中的基础性问题。本文的研究初步缓解了物联网大数据的存储与管理压力,并进一步实现原型系统,为大数据高效存储与管理的进一步验证和实验、应用提供支持,为大数据管理理论与系统化方法提供新思路。
[Abstract]:The Internet of things (The Internet of Things, IoT) is a huge network of massive sensing equipment and Internet to combine and form. In the Internet of things, the mass sensing equipment constantly collect data and send to the data center; with the sensing technology and network technology development, data showing the mass characteristics, formation the big data networking. For persistent storage of data on the Internet of things, can get any sensor history and current sensing data, through the retrieval and statistical analysis of data, perception and trend analysis can realize the complex and rules; data storage and management to flow tasks running in the data center, through the energy saving task scheduling to reduce the cost, networking applications. These are the city safe, smart city, target recognition and tracking, location services and other areas brought new opportunities. The IOT data storage system Storage and management, need persistent storage of data, real-time data retrieval, analysis and processing of data and provide timely, efficient computing framework, finally realize the perception and effective control of data. However, the mass characteristics of big data networking and data storage tube science has brought great challenges. First of all, "persistence", mass sensor frequently produce new data acquisition, and sent to the data center, formed per second GB writes data flow, brings great challenges to the traditional HDFS storage system. In HDFS large scale distributed file system as the representative, although they support large data storage however, since these file systems are designed without considering the real-time, high performance data storage, so the data cannot meet the growing demand for online storage, such as HDFS in the face of massive small files Data flow, single performance often drops to MB/s, can not meet the actual demand. Second, data retrieval, data stored in persistent equipment, need the help of data retrieval system, quickly find the data, but the relational database, the database system can not effectively meet the needs of the Internet of things NoSQL database data the design of NoSQL database retrieval, such as disk read and write mode, based on the index structure, query execution, query optimization, recovery strategy, but the inherent drawbacks of disk read and write poor performance limits of big data storage especially big data analysis performance. Third, "statistical analysis", which requires the establishment of data the cube, in order to achieve efficient data statistical analysis. But the traditional data cube, such as HIVE, are only for statistical analysis was carried out to determine the type of data, when in the face of things The probabilistic data, statistical analysis of the time cost for "hour" level, can not meet the needs of practical application. Finally, data storage, retrieval, analysis in the form of a running stream task in the data center, data center maintenance costs 40% energy cost, how to realize the energy saving scheduling becomes a the key to reduce the cost of the data center, such task scheduling does not support task scheduling platform and the Hadoop YARN as the representative. In summary, the data storage and management technology of many large data network in the face of things, there are limitations. In view of the above problems, this paper put forward a framework of data storage and management system for large data networking "(Sensor Storage).Sensor Storage is a distributed data storage, retrieval, analysis platform, mainly including the following key technologies. (1) for massive small files The distributed file system. This study established a SensorFS based distributed storage system HDFS extension, the system architecture of massive small files fast storage, query optimization, and provides high scalability, data security; this study proposes to write throughput optimization mechanism and algorithm of massive small files, small file write bottleneck analysis and modeling, design of small file optimization strategy; the massive small files in the HDFS file read performance optimization mechanism; (2) a space efficient key data retrieval system. This study established a retrieval system based on Tree Radix RadixKV key data, for massive content in distributed file system keywords provide fast data retrieval service based on Radix Tree; this paper analyzes the advantages and disadvantages of the Radix Tree, online update performance analysis, and design a Adaptive parallel index update strategy; proposes an optimized expression of the space overhead of Radix Tree Radix Array, designed the data structure of Radix Array, and analyzes the space overhead of Radix Array. (3) data cube system based on probabilistic data. The analysis of large data networking in the "uncertainty" characteristics. And in the light of the design of the ProbabilisticCube data cube system for probabilistic data, provide probabilistic fast data aggregation query service; probabilistic data in large data networking model definition, and based on probability definition data model, design of probabilistic data cube; data aggregation operation probability of high performance design; design of cost data the cube model estimation of implementation strategy based on the probability of data oriented design; slice and dice query query. (4) energy effective task scheduling framework is built. A Hadoop based YARN scalable distributed task scheduling framework Green Yarn flow task distributed task scheduling framework of new things to make reasonable scheduling, without any performance loss, combined with the characteristics of server dynamic voltage scaling (DVFS), the server task and reasonable matching; we design energy efficiency model based on task, and is designed for off-line batch processing and online task scheduling algorithm. Through this system, is expected to establish a new storage architecture for the Internet of things big data, the file system, data retrieval and analysis put forward the optimization design innovation, to solve fundamental problems in this study. Preliminary ease of large data storage and management of pressure things, and further realizes the prototype system for storage and management of large data, further validation and experiment The application provides support to provide new ideas for large data management theory and systematization.

【学位授予单位】：中国科学技术大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP391.44;TN929.5
，

本文编号：1514067

资料下载

论文发表

支付宝下载

Download by Alipay
微信下载

Download by Wechat
会员下载

Download by Member

本文链接：https://www.wllwen.com/shoufeilunwen/xxkjbs/1514067.html

上一篇：HVPE生长自支撑GaN单晶及其性质研究
下一篇：无合作目标激光测距中目标特性对测量影响研究

论文发表

·知网|万方|维普|龙源|省级|国家级|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|