太阳望远镜海量数据存储关键技术研究
发布时间:2019-07-05 20:40
【摘要】:当前天文数据处理技术已经进入了数据密集型的天文信息学时代,大数据是比较典型的特征。在太阳观测中,具体表现为数据量庞大、数据采集速率高和数据连续性增长。传统的本地存储技术,例如DAS,以及网络存储技术,例如NAS以及SAN等,在天文大数据存储、处理和管理的需求背景下,表现出诸多的局限性,这些局限性为很多科研活动的开展带来不便。以海量数据为基础的现代天文观测迫切需要先进的大数据处理技术来加快数据的处理,例如MapReduce,为了支持这些处理技术的应用,存储系统需要能够提供高性能、可扩展的并发读写能力和具备海量天文数据的管理能力。 一米新真空红外太阳望远镜(The1m New Vacuum Solar Telescope-NVST)已经投入运行,采用高速度、多通道、多终端的数据采集模式,目前已经产生了超过200TB的太阳观测数据。在观测条件理想时,光球和色球两个通道同时观测,当前色球通道和光球通道能够分别达到每小时60GB和190GB的高速采集速率,按照8小时观测时长计算,一天能够产生2TB(Terabytes)左右的观测数据。随着NVST高分辨率成像系统对数据的时间和空间分辨率要求的提高,未来更多通道并发工作时,单向写入速度能够达到每秒TB量级。如果考虑到实时的数据处理,这个速度还要翻倍。在这样的速度下,单机硬盘存储已难以满足NVST持续、高速的数据写入。当前一些主流存储技术,例如固态硬盘,因为成本,读写次数有限等因素限制它们在太阳观测中的应用,这极大地限制了NVST的科研产出。 另外,传统的数据存储关键技术,例如本地文件系统Ext3、Ext4以及新兴的文件系统ZFS等已难以满足太阳观测中高速的并发数据读写需求;基于关系型数据库的数据管理技术也不能很好的应对NVST海量数据管理的需要。面对这些问题,就迫切需要寻求能够管理海量数据,具有高性能、高扩展性以及能适应NVST存储需求动态变化和支持高速数据处理的存储技术。虽然一些前沿技术,例如基于DAS和SAN的存储整合技术、虚拟化存储技术能够满足这些需要,但是他们的技术复杂、实际部署、配置和管理维护成本较高,也不适合在太阳观测中应用。分布式并行存储技术能够很好地满足这些需求,因为基于分布式的存储能够提供高性能的并发存储并具有良好的横向扩展特性,可以部署在普通的廉价主机上,综合成本、性能和可扩展管理等方面的考虑,分布式存储比较适合NVST多通道多波段观测模式的海量数据存储技术。另外,如何高效快速地检索和查询海量观测数据也是存储管理中具有挑战性的难题,基于分布式的非关系型数据库(NoSQL)数据存储管理技术能够有效应对这些挑战。因此,本论文以分布式存储技术为核心,研究分布式文件系统和基于NoSQL海量数据检索查询技术在太阳观测中的应用,论文主要研究工作包括: 1)分布式文件系统在太阳观测中的应用。通过实验从横向和纵向两方面深入研究了分布式文件系统的存储性能、可扩展性,以及分布式文件系统在太阳观测应用中的可行性;研究了基于FITS文件的存储性能优化,通过Bonding技术在千兆网络环境下单进程能够达到3.4Gb/s的存储速度,满足了NVST当前高速的存储需要;重点研究了分布式文件系统在太阳观测中的应用模式和如何满足异构平台的数据存储需要; 2)研究了太阳FITS元数据和数据在分布式存储中的不一致性问题。在分布式存储环境下,因为高效的数据查询和管理需要,观测的FITS元数据与数据被分离存储。这可能因为短暂的网络、硬盘等故障导致大量的元数据和数据之间的不一致。如何采取有效的保障机制约束元数据和数据之间的一致性是在高速数据存储过程中容易被忽略的问题。本文在这方面进行了研究,分析了不一致性产生的原因,不一致性模型以及应对措施,并提出应用两段提交协议来尽可能保证二者之间的一致性; 3)设计了面向太阳观测的分布式存储系统AstroFS,阐述了它的核心组件设计。其中包括了高性能特性设计,例如,根据太阳观测的要求,,放弃多层次树状文件目录,使用两级扁平化的目录存储观测文件;研究设计基于网络的RAID0数据分片技术。对系统中的其它关键技术也进行了详细的分析和设计,例如数据的聚合拆分,数据均衡分布存储,并发以及复制等; 4)通过形式化方法描述了NoSQL存储非结构化FITS文件的通用模式,使用基于压缩的字对齐位图索引算法来对海量天文数据进行索引。设计和实现了一个基于Fastbit的天文观测数据归档系统,该系统具有高效的索引性能和检索效率等优点。 论文研究的面向海量太阳观测数据的分布式存储技术和基于压缩字对齐位图索引技术解决了NVST观测数据的快速存储和高效检索难题,实际应用性较强。研究方法也为未来国内外类似太阳望远镜的存储和海量数据的检索提供了参考,具有一定的应用和推广价值。
[Abstract]:The current astronomical data processing technology has entered the data-intensive astronomical information age, and the big data is a typical characteristic. In the solar observation, the data volume is large, the data acquisition rate is high, and the data continuity grows. Traditional local storage technologies, such as DAS, and network storage technologies, such as NAS and SAN, show a number of limitations in the context of the need for large-scale data storage, processing and management, which limit the development of many scientific research activities. Modern astronomical observation based on mass data urgently needs advanced data processing technology to speed up the processing of data, such as MapReduce, in order to support the application of these processing technologies, the storage system needs to be able to provide high performance, Scalable concurrent read-write capability and management capabilities with massive astronomical data. The 'The 1 m New Vacuum Solar Telescope-NVST' has been put into operation, using a high-speed, multi-channel, multi-terminal data acquisition mode, which has now produced more than 200 TB of solar observations It is reported that when the observation condition is ideal, two channels of the optical ball and the color ball are simultaneously observed, the current color ball channel and the optical ball channel can reach the high-speed acquisition rate of 60GB and 190GB per hour respectively, and the observation number of about 2TB (TeraBytes) can be generated one day according to the observation time period of 8 hours. It is reported that with the increase of time and space resolution requirements of the NVST high-resolution imaging system for data, the one-way write speed can reach TB per second in a more multi-channel concurrent operation in the future Level. If real-time data processing is taken into account, this speed will be turned over X. In such a speed, the stand-alone hard disk storage is hard to meet the NVST persistent, high-speed data writing The current mainstream storage technology, such as a solid-state hard disk, limits their application in solar observation due to the limited cost, the limited number of read-write times, and so on, which greatly limits the scientific research and production of the NVST In addition, traditional data storage technologies, such as local file systems Ext3, Ext4, and emerging file system ZFS, are hard to meet high-speed concurrent data read in the sun's observations Write demand; data management technology based on relational database can't well deal with NVST mass data management In the face of these problems, there is an urgent need to seek to be able to manage large amounts of data, with high performance, high scalability, and the ability to adapt to the dynamic change of the NVST storage requirements and to support the storage of high-speed data processing Storage technology. While some of the front-edge technologies, such as storage consolidation technologies based on DAS and SAN, virtualized storage technology can meet these needs, their technology is complex, practical deployment, configuration, and management maintenance costs are high and are not suitable for solar observations The distributed parallel storage technology can meet these requirements well because the distributed storage can provide high-performance concurrent storage and has good scale-out characteristics, and can be deployed on common low-cost hosts, comprehensive cost, performance, and scalable management. and the distributed storage is suitable for mass data storage of the NVST multi-channel multi-band observation mode, In addition, how to efficiently and quickly retrieve and query massive observation data is a challenging problem in storage management, and the distributed non-relational database (NoSQL) data storage management technology can have effect on this Therefore, this paper studies the application of distributed file system and NoSQL mass data retrieval and query technology in the solar observation, and the main research work of this paper is the distributed storage technology as the core. To include:1) Distributed file system in the Sun view In this paper, the storage performance, the scalability of the distributed file system and the feasibility of the distributed file system in the application of the sun observation are studied from both the lateral and the longitudinal aspects by the experiment. The storage performance of the file system based on the FITS is studied. The storage performance is optimized, and the storage speed of 3.4 Gb/ s can be achieved through the processing of the bonding technology in the gigabit network environment, and the storage requirements of the current high-speed of the NVST are met; the application mode of the distributed file system in the solar observation and the number of the heterogeneous platforms can be met the storage needs;2) the study of solar FITS metadata and data in distributed storage In a distributed storage environment, the observed FITS metadata is consistent with the need for efficient data query and management. The data is stored separately. This may result in a large number of metadata and numbers due to a short network, hard drive, and the like How to take effective safeguard mechanism to restrain the consistency between the metadata and the data is in the process of high-speed data storage In this paper, the causes of the inconsistency, the model of the inconsistency and the countermeasures are analyzed, and the two-stage submission agreement is put forward to ensure that the two are as guaranteed as possible. the consistency between the users;3) the design of distributed storage for solar observation System AstriFS, set out The design of its core components includes the design of high-performance characteristics. For example, according to the requirements of the sun observation, the multi-level tree file directory is abandoned, and the two-level flat directory storage observation files are used; and the research and design of the network-based RA ID0 data slicing technology. The other key technologies in the system are analyzed and designed in detail, such as the aggregation and resolution of data, and the distribution of data. storage, concurrency, and replication;4) NoSQL storage non-structured is described by a formal method Common mode for FITS files, using compressed-based, word-aligned bitmap indexing algorithms An astronomical observation data archiving system based on Fastbit is designed and implemented, which has high efficiency. The paper studies the distributed storage technology of the mass sun observation data and the technology of the bitmap index based on the compressed word, and solves the fast storage and high efficiency of the NVST observation data. The research method also provides a reference for the storage of similar solar telescope at home and abroad and the retrieval of mass data in the future.
【学位授予单位】:中国科学院研究生院(云南天文台)
【学位级别】:博士
【学位授予年份】:2014
【分类号】:P111.41
本文编号:2510800
[Abstract]:The current astronomical data processing technology has entered the data-intensive astronomical information age, and the big data is a typical characteristic. In the solar observation, the data volume is large, the data acquisition rate is high, and the data continuity grows. Traditional local storage technologies, such as DAS, and network storage technologies, such as NAS and SAN, show a number of limitations in the context of the need for large-scale data storage, processing and management, which limit the development of many scientific research activities. Modern astronomical observation based on mass data urgently needs advanced data processing technology to speed up the processing of data, such as MapReduce, in order to support the application of these processing technologies, the storage system needs to be able to provide high performance, Scalable concurrent read-write capability and management capabilities with massive astronomical data. The 'The 1 m New Vacuum Solar Telescope-NVST' has been put into operation, using a high-speed, multi-channel, multi-terminal data acquisition mode, which has now produced more than 200 TB of solar observations It is reported that when the observation condition is ideal, two channels of the optical ball and the color ball are simultaneously observed, the current color ball channel and the optical ball channel can reach the high-speed acquisition rate of 60GB and 190GB per hour respectively, and the observation number of about 2TB (TeraBytes) can be generated one day according to the observation time period of 8 hours. It is reported that with the increase of time and space resolution requirements of the NVST high-resolution imaging system for data, the one-way write speed can reach TB per second in a more multi-channel concurrent operation in the future Level. If real-time data processing is taken into account, this speed will be turned over X. In such a speed, the stand-alone hard disk storage is hard to meet the NVST persistent, high-speed data writing The current mainstream storage technology, such as a solid-state hard disk, limits their application in solar observation due to the limited cost, the limited number of read-write times, and so on, which greatly limits the scientific research and production of the NVST In addition, traditional data storage technologies, such as local file systems Ext3, Ext4, and emerging file system ZFS, are hard to meet high-speed concurrent data read in the sun's observations Write demand; data management technology based on relational database can't well deal with NVST mass data management In the face of these problems, there is an urgent need to seek to be able to manage large amounts of data, with high performance, high scalability, and the ability to adapt to the dynamic change of the NVST storage requirements and to support the storage of high-speed data processing Storage technology. While some of the front-edge technologies, such as storage consolidation technologies based on DAS and SAN, virtualized storage technology can meet these needs, their technology is complex, practical deployment, configuration, and management maintenance costs are high and are not suitable for solar observations The distributed parallel storage technology can meet these requirements well because the distributed storage can provide high-performance concurrent storage and has good scale-out characteristics, and can be deployed on common low-cost hosts, comprehensive cost, performance, and scalable management. and the distributed storage is suitable for mass data storage of the NVST multi-channel multi-band observation mode, In addition, how to efficiently and quickly retrieve and query massive observation data is a challenging problem in storage management, and the distributed non-relational database (NoSQL) data storage management technology can have effect on this Therefore, this paper studies the application of distributed file system and NoSQL mass data retrieval and query technology in the solar observation, and the main research work of this paper is the distributed storage technology as the core. To include:1) Distributed file system in the Sun view In this paper, the storage performance, the scalability of the distributed file system and the feasibility of the distributed file system in the application of the sun observation are studied from both the lateral and the longitudinal aspects by the experiment. The storage performance of the file system based on the FITS is studied. The storage performance is optimized, and the storage speed of 3.4 Gb/ s can be achieved through the processing of the bonding technology in the gigabit network environment, and the storage requirements of the current high-speed of the NVST are met; the application mode of the distributed file system in the solar observation and the number of the heterogeneous platforms can be met the storage needs;2) the study of solar FITS metadata and data in distributed storage In a distributed storage environment, the observed FITS metadata is consistent with the need for efficient data query and management. The data is stored separately. This may result in a large number of metadata and numbers due to a short network, hard drive, and the like How to take effective safeguard mechanism to restrain the consistency between the metadata and the data is in the process of high-speed data storage In this paper, the causes of the inconsistency, the model of the inconsistency and the countermeasures are analyzed, and the two-stage submission agreement is put forward to ensure that the two are as guaranteed as possible. the consistency between the users;3) the design of distributed storage for solar observation System AstriFS, set out The design of its core components includes the design of high-performance characteristics. For example, according to the requirements of the sun observation, the multi-level tree file directory is abandoned, and the two-level flat directory storage observation files are used; and the research and design of the network-based RA ID0 data slicing technology. The other key technologies in the system are analyzed and designed in detail, such as the aggregation and resolution of data, and the distribution of data. storage, concurrency, and replication;4) NoSQL storage non-structured is described by a formal method Common mode for FITS files, using compressed-based, word-aligned bitmap indexing algorithms An astronomical observation data archiving system based on Fastbit is designed and implemented, which has high efficiency. The paper studies the distributed storage technology of the mass sun observation data and the technology of the bitmap index based on the compressed word, and solves the fast storage and high efficiency of the NVST observation data. The research method also provides a reference for the storage of similar solar telescope at home and abroad and the retrieval of mass data in the future.
【学位授予单位】:中国科学院研究生院(云南天文台)
【学位级别】:博士
【学位授予年份】:2014
【分类号】:P111.41
【参考文献】
相关期刊论文 前6条
1 于红;高艳萍;郭连喜;;改进的两阶段提交协议[J];大连水产学院学报;2005年04期
2 董立岩;毛锐;余宜诚;王利民;黄乐;殷涵;;基于分布式多服务系统的数据同步方法[J];吉林大学学报(理学版);2011年04期
3 李光,李性存;分布式事务处理系统中超时值的一种计算方法[J];计算机学报;1990年11期
4 赵铁柱;董守斌;Verdi MARCH;Simon SEE;;面向并行文件系统的性能评估及相对预测模型[J];软件学报;2011年09期
5 李建;崔辰州;何勃亮;赵永恒;曹子皇;樊东卫;李长华;谌悦;;天文数据库回顾与展望[J];天文学进展;2013年01期
6 崔辰州;李文;于策;徐祯;赵永恒;于建军;;FITS数据文件的检索和访问[J];天文研究与技术;2008年02期
相关博士学位论文 前5条
1 罗东健;大规模存储系统高可靠性关键技术研究[D];华中科技大学;2011年
2 王禹;分布式存储系统中的数据冗余与维护技术研究[D];华南理工大学;2011年
3 岳利群;基于分布式存储的虚拟地理环境关键技术研究[D];解放军信息工程大学;2011年
4 吴伟;海量存储系统元数据管理的研究[D];华中科技大学;2010年
5 刘立坤;海量文件系统元数据查询方法与技术[D];清华大学;2011年
本文编号:2510800
本文链接:https://www.wllwen.com/kejilunwen/tianwen/2510800.html