基于MapReduce的空间数据RkNN算法研究
发布时间:2018-04-03 11:33
本文选题:RkNN查询 切入点:MapReduce 出处:《大连理工大学》2013年硕士论文
【摘要】:近几年,随着移动互联网技术和地理信息技术的发展,基于位置服务应用逐渐兴起,从而使得空间定位信息的数据量呈现以指数级增长。而在地理位置信息相关的空间数据查询中,RkNN (Reverse k Nearest Neighbor,反最近邻)查询问题是通过返回所给查询点周围的对象中,以该查询点作为其kNN (k Nearest Neighbor,最近邻)的所有对象的集合,其在多种数据挖掘应用中备受关注。同样传统的RkNN方法也已难以满足迅猛增长的数据速度以及用户们的大规模实时查询要求。 本文将传统的RkNN算法与MapReduce分布式框架相结合,分析如何解决大规模空间数据的分布式RkNN查询问题。MapReduce是2004年由谷歌公司提出的一个用来进行并行处理和生成大数据集的模型,而MapReduce框架作为分布式计算中的典型的离线计算框架,很难实现实时性计算效果。因此,本文采用了离线和在线相结合的系统模型,利用MapReduce框架离线完成倒排网格索引的创建和更新工作,同时结合在线计算方法返回RkNN查询结果。文中首先提出基于大规模空间数据集上的倒排网格索引的暴力RkNN查询算法——Basic-MRkNN算法;接下来提出对此算法的优化算法——延迟Lazy-MRkNN查询算法和增量式Eager-MRkNN查询算法。为了减少网络和磁盘I/O开销,在过滤过程中利用了一些剪枝规则来提高本文提出的分布式算法性能。此外,通过在32个节点的集群上对模拟数据集和真实数据集的大量实验表明,本文提出的基本算法在与泰森多边形分布式RkNN查询算法相比时能提速50%以上。此外,两种优化算法均优于基本算法,Eager-MRkNN算法比较适合处理密集型数据,而Lazy-MRkNN算法比较适合处理稀疏型数据。
[Abstract]:In recent years, with the development of mobile Internet and geographic information technology, the application of location-based services is gradually rising, which makes the amount of spatial location information increase exponentially.In the spatial data query related to geographical location information, the problem of rkNN / inverse Nearest neighbor is to use the query point as the collection of all objects of its kNN Nearest neighbor by returning the object around the given query point.It has attracted much attention in many kinds of data mining applications.The traditional RkNN method is also difficult to meet the rapid growth of data speed and users of large-scale real-time query requirements.This paper combines the traditional RkNN algorithm with the MapReduce distributed framework, and analyzes how to solve the distributed RkNN query problem of large-scale spatial data. MapReduce is a model proposed by Google in 2004 for parallel processing and big data set generation.However, as a typical offline computing framework in distributed computing, MapReduce framework is difficult to achieve real-time computing effect.Therefore, an offline and on-line system model is adopted in this paper. The inverted grid index is created and updated offline using the MapReduce framework, and the RkNN query results are returned with the on-line calculation method.In this paper, we first propose a robust RkNN query algorithm based on inverted grid index on large spatial data sets, Basic-MRkNN algorithm, and then we propose an optimization algorithm for this algorithm, which is the delayed Lazy-MRkNN query algorithm and the incremental Eager-MRkNN query algorithm.In order to reduce the I / O overhead of the network and disk, some pruning rules are used to improve the performance of the distributed algorithm proposed in this paper.In addition, a large number of experiments on simulated data sets and real data sets on 32-node clusters show that the proposed algorithm can increase the speed by more than 50% compared with Tyson polygon distributed RkNN query algorithm.In addition, the two optimization algorithms are better than the basic algorithm, Eager-MRkNN algorithm is more suitable for processing intensive data, while the Lazy-MRkNN algorithm is more suitable for sparse data processing.
【学位授予单位】:大连理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP311.13;TP338.8
【参考文献】
相关期刊论文 前1条
1 过志峰,王宇翔,杨崇俊;空间数据索引与查询技术研究及其应用[J];计算机工程与应用;2002年23期
相关博士学位论文 前1条
1 高云君;时空数据库查询处理关键技术研究[D];浙江大学;2008年
,本文编号:1705031
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1705031.html