基于Spark的空间数据平台系统的设计与实现

发布时间：2018-01-10 10:17

本文关键词：基于Spark的空间数据平台系统的设计与实现　出处：《山东大学》2017年硕士论文　论文类型：学位论文

【摘要】：空间数据,也被称为地理数据。空间数据是可以由地理坐标系位置表示的感卫星监测产生的地理信息,如河流,湖泊,城镇。移动通信网络中的手机通话信息,城交通网络中的安装有GPS的车辆位置信息,社交网络中产生的带有位置的信息。充分分析和利用这些空间数据将会在环境处理,通信安全和交通规划等领域具有重要作用。物理对象的信息。当前,众多行业持续不断地产生了大量的空间数据。随着大量有价值的空间数据的产生,使用适应于大规模空间数据处理的工具对空间数据进行分析与处理的需求越来越迫切。但是,当前的关系型数据库技术和分布式计算系统却并不适合于处理空间数据。空间数据索引结构不适合用关系数据库表达,从而导致关系型数据库处理空间数据查询操作效率低下。由于MapReduce编程模型的缺点,现有的基于HDFS和MapReduce的分布式数据分析框架处理交互式查询和迭代操作时速度较慢。MapReduce模型使用如下方式对数据进行处理:首先从集群磁盘中读取数据到内存,对执行计算,然后将结果从内存写到集群磁盘,作为下次计算的输入。每次计算过程产生的冗余磁盘读写开销使得基于MapReduce的算法实现存在严重的性能问题,无法满足用户对大规模空间数据实时分析的要求。Apache Spark是一个新兴的集群计算框架,与MapReduce框架相比,Spark提供内存迭代计算功能。计算数据可以常驻内存而省去磁盘I/O时间。在交互式查询环境中,比目前最流行的并行计算工具Hadoop快100多倍。随着Spark框架不断的更新与发展,研究人员开始通过扩展Spark实现对空间数据的分布式查询处理。GeoSpark和SpatialSpark是目前为止最先进的系统。他们通过扩展Spark实现了空间数据的分布式存储的查询操作。这两个系统的系统框架类似,都主要由三层组成:空间数据存储层,数据索引层和查询处理层,空间数据存储层实现对大规模空间数据的分布式存储。数据索引层将传统的空间索引技术应用于分布式存储的空间数据集群。查询处理层对用户提供空间查询操作接口,通过索引层和存储层,实现空间数据分析。提供的查询操作包含区域查询,空间关联查询和空间k最近邻查询。但是GeoSpark和SpatialSpark在设计上仍然存在一系列缺点,导致最终的查询性能不高。本文,我们通过全面改进上述系统架构,分别使用了新的空间数据分区策略,索引结构和查询处理技术,设计并实现了一个新的基于Spark的空间数据计算系统Spark-GIS,全面的实验表明,Spark-GIS比上述系统具有更高的查询性能。Spark-GIS的主要创新包括以下三个方面:1.在空间数据存储层,设计并实现了一个新的空间数据分区策略,使用新的分区策略实现的空间数据分布式存储为上层的空间数据查询提供了更好的支持,确保空间数据查询时避免工作负载均衡问题。2.在空间数据索引层,设计并实现了一种基于Voronoi图的R树空间索引结构,与R树相比,在未降低系统空间查询性能同时,大大减少生成空间索引结构的时间和空间索引结构的大小。3.在空间数据分析层,通过结合改进的空间数据分布式存储策略,空间索引技术,实现了基于Spark的并行空间数据查询算法,能够为用户提供海量高并发的空间数据交互式查询。包括空间区域查询,空间联接查询和空间k最近邻询。最后,我们对Spark-GIS,Spark和GeoSpark进行了全面的对比测试。测试数据是数量为亿级别的移动电话通话记录数据。实验结果显示Spark-GIS空间查询操作性能全面优于目前为止最先进的系统——GeoSpark,尤其在空间区域查询和空间联接查询方面,性能比GeoSpark改善了多个数量级。
[Abstract]:Spatial data, also known as geographic data. Spatial data is from the geographical coordinates of the position of said sense satellite monitoring produces geographic information, such as rivers, lakes, cities and towns. In the mobile communication network of mobile phone call information, city traffic network is installed on the vehicle position information of GPS, produced with location information in a social network. The full analysis and use of the spatial data in the environment will play an important role in the field of communication, security and traffic planning. The physical object information. At present, many industries continue to produce a large number of spatial data. Spatial data with a large number of valuable production needs, suitable for use in large scale spatial data tools the processing of spatial data analysis and processing become more and more urgent. However, relational database technology and distributed computing system currently is not suitable for the treatment of air Among the data. Spatial data index structure is not suitable for expression in relational database, resulting in relational database processing spatial data query efficiency. The MapReduce programming model, the existing HDFS and MapReduce distributed data analysis framework based on the interactive processing model of.MapReduce slow speed of query and iterative operation when using the following method for data processing: first to read data into memory from the cluster disk, to perform a calculation, then the results from the cluster disk memory writes, as the next calculation input. Each calculation process produces redundant disk read and write overhead that implements MapReduce algorithm based on serious performance problems, unable to meet user requirements for real-time analysis of large scale spatial data.Apache Spark is an emerging cluster computing framework, compared with the MapReduce framework, Spark provides internal storage The iterative calculation function. The calculation data can be saved to disk I/O memory resident time. In the interactive query environment, calculation tool Hadoop 100 times faster than the parallel current most popular Spark framework. With the constantly updated and development, researchers began by extending Spark to realize distributed spatial data query processing on.GeoSpark and SpatialSpark is the current system so far the most advanced. They through extending Spark to realize distributed data storage query system framework of these two systems are similar, mainly consists of three layers: the spatial data storage layer, data layer index and query processing layer, realize the distributed storage of large scale spatial data spatial data index data storage layer. The layer will be traditional spatial indexing technology used in distributed storage of spatial data. Cluster processing layer provides the user with the query spatial query operation In the index layer and storage layer, realize spatial data analysis. The query contains range queries, nearest neighbor queries of spatial query and spatial correlation of K. But GeoSpark and SpatialSpark still has a series of shortcomings in design, leading to final query performance is not high. In this paper, we improved the system through a comprehensive architecture. Using spatial data partition strategy, index structure and query processing technology, the design and implementation of a new computing system Spark-GIS based on Spark spatial data, comprehensive experiments to show the main innovation of Spark-GIS has a better performance than the.Spark-GIS query of the system includes the following three aspects: 1. in spatial data the storage layer, the design and implementation of a new spatial data partitioning strategy, spatial data distributed storage using the partition strategy of new implementation for the upper spatial data query To provide better support, to avoid the problem of work load balance in.2. spatial data index that spatial data query, the design and implementation of a R tree spatial index structure based on Voronoi, compared with the R tree, the query performance and reduce system in space, greatly reduce the generation time and the spatial index structure of spatial index the size of the structure of.3. in spatial data analysis layer, by combining spatial data distributed storage strategy improved, spatial indexing technology, realize the parallel query algorithm based on Spark spatial data, high concurrency can provide massive spatial data interactive query for users. Including spatial query, spatial join query and nearest neighbor query. Finally K space we, on Spark-GIS, Spark and GeoSpark are tested comprehensively. The test data is the mobile phone number to billion level call records data. The experimental results It shows that the performance of Spark-GIS spatial query operation is much better than the most advanced system so far -- GeoSpark, especially in spatial area query and spatial join query, its performance is improved by more than GeoSpark.

【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：P208;TP311.52

【相似文献】