LAMOST科学计算云平台系统的构建与应用

发布时间：2019-01-23 20:04

【摘要】：随着探测器和空间技术的发展,天文观测从可见光、射电波段扩展到包括红外、紫外、X射线和γ射线在内的电磁波各个波段,形成了全波段天文学,现发展到了一个全新的阶段,即全波段-大样本-巨信息量时期。天文学已然成为各学科中拥有海量数据的龙头老大,由于天文数据量的庞大和增长速度的迅猛,这些巡天项目产生的数据量通常可以达到TB甚至PB级。如斯隆数字巡天SDSS,用了十年时间来覆盖8000平方度的天空,得到大约108个恒星、星系及类星体的大约40TB的成像及光谱数据。随着LAMOST巡天计划的开展,要完成对1000万个星系、100万个类星体及1000万颗恒星光谱的观测,将产生的数据将会是SDSS的十倍之多,对海量数据的存储和处理将会是一个极大的挑战,本文针对LAMOST的需求,对海量光谱的数据存储和处理构建了一套适合天文数据处理的科学计算平台并设计并实现了可定制的云储存系统。本文主要工作如下: 1、在LAMOST数据处理中心的24台服务器上构建了一套基于Hadoop开源框架并适合天文数据处理的科学计算平台,其中包含NumPy、SciPy、PyFITS等常用的工具包。使用Python和Shell完成自动部署的程序包,以方便快捷地添加删除物理节点以及设置负载均衡。 2、基于Hadoop核心组件HDFS,设计并实现了多用户的云存储系统,为用户提供了新建文件夹、文件上传、下载文件/文件夹、删除文件/文件夹、回收站、记事本及个人信息管理等功能。另外,管理员角色拥有账号管理(包括新增、修改、配额、删除等操作)、单位管理及系统信息查询功能等。用户利用该平台可以方便地存储相关数据和处理结果等。 3、研究了科学计算平台的核心组件MapReduce编程模型。在目前较完善的模板匹配算法基础上,使用MapReduce编程规范完成模板匹配,使用KNN和卡方最小化算法对数据进行了测试来验证改进之后的算法,并分别在单机和集群环境下进行了性能对比分析。
[Abstract]:With the development of detectors and space technology, astronomical observation extends from visible light, radio wave band to electromagnetic wave band including infrared, ultraviolet, X ray and 纬 ray, forming full band astronomy. Now it has reached a new stage, that is, the period of full-band-large sample-huge information. Astronomy has become the leader in the field of science with huge amounts of data. Due to the large amount of astronomical data and the rapid growth of astronomical data, the amount of data generated by these survey projects can usually reach TB or even PB level. For example, the Sloan Digital Sky Survey (SDSS,) took 10 years to cover 8000 square degrees of sky, obtaining about 108 stars, galaxies and quasars about 40TB imaging and spectral data. With the launch of the LAMOST survey program, the spectral observations of 10 million galaxies, 1 million quasars and 10 million stars will produce ten times as much data as SDSS, which will pose a great challenge to the storage and processing of massive data. In order to meet the requirements of LAMOST, a scientific computing platform for astronomical data processing is constructed and a customizable cloud storage system is designed and implemented. The main work of this paper is as follows: 1. A set of scientific computing platform based on Hadoop open source framework and suitable for astronomical data processing is built on 24 servers of LAMOST data processing center, which includes NumPy,SciPy,PyFITS and other commonly used toolkits. Use Python and Shell to complete automatic deployment packages to add and delete physical nodes and set load balancing quickly. 2. Based on Hadoop core component HDFS, a multi-user cloud storage system is designed and implemented, which provides users with new folder, file upload, download file / folder, delete file / folder, recycle bin, etc. Notepad and personal information management functions. In addition, the administrator role has account management (including new, modified, quota, delete and other operations), unit management and system information query functions. Users can conveniently store relevant data and process results by using the platform. 3. The MapReduce programming model of the core component of scientific computing platform is studied. On the basis of the current perfect template matching algorithm, we use MapReduce programming specification to complete template matching, and use KNN and chi-square minimization algorithm to test the data to verify the improved algorithm. The performance comparison and analysis are carried out in single machine and cluster environment respectively.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】