分布式环境下Top-K计算问题研究
发布时间:2018-11-09 08:57
【摘要】:Top-k计算作为一种偏好查询,是数据库中一个最基本的操作,旨在从给定的数据集中查找出用户可能感兴趣的信息。作为一种数据分析的重要工具,top-k计算在网页搜索、电子商务、数据挖掘、多标准决策支持等领域有着广泛的应用。随着大数据时代的来临,传统的top-k处理技术遇到前所未有的挑战,已经无法满足大数据分析的需求。新环境下的top-k计算主要面临着三个挑战:一是数据规模达到TB或者PB级,传统的单机处理方式不再适用,应该考虑分布式并行计算框架;二是对于面对海量数据集,在分布式环境下,采取怎样的数据划分方法才能够提升并行性能和查询速度;三是传统的top-k查询需要用户给定一个评分函数,而选择一个合适的评分函数却不是件容易的事。因此,本文对分布式环境下top-k计算的数据划分和并行算法设计的关键技术进行了研究和探索,主要的研究内容包括:(1)在分布式计算框架下,针对于加权top-k查询问题,提出了类似网格数据划分方式,将原始数据集划分为不同的子数据集,根据用户偏好选取子数据集代替全部数据集进行查询,减少查询数据。针对于高维度中的“空空间”现象,本文在网格划分基础上引入超平面划分。与基于角度和超平面的数据划分方式相比,该方法预处理简单不用进行复杂坐标转换,而且对于较高维度中出现的“空空间”现象依旧适用。实验结果证明在大数据环境下类似网格和超平面数据划分方法查询速度比基于角度划分方法快了接近15‰此外对于数据维度较高时候出现的“空空间”现象(实验中即:d大于等于8),比基于角度划分方法,查询结果更准确,同时具有良好的可扩展性。(2)针对于传统的top-k查询需要用户给定一个评分函数,而某些用户难以给出一个合理的评分函数这一问题。在结合已有的单机算法基础上提出了五种在分布式平台下基于度量空间的并行top-k dominating查询算法。算法1利用skyline集合中一定包含top-1 dominating结果这一结论,分区并行计算skyline,来加快处理速度;同时利用候选集的支配关系,避免k次重复计算。算法2利用k-skyband集合中包含所有top-k dominating结果这一结论,每个分区并行计算k-skyband,避免k次循环。算法3在算法1基础上,首先结合ANN对原始数据进行筛选,加快对skyline的计算。算法4为一种基于集合ANN和k-skyband的剪枝算法,该算法利用集合ANN预先剪枝,再求k-skyband,最后获取top-k dominating,加快计算k-skyband速度。算法5为一种基于排序剪枝的top-k dominating算法,该算法根据查询输入集合Q对数据集排序,建立索引表,采用round-robin方式读取索引表,避免遍历原始数据集来计算每个候选集的支配分数。实验结果表明这五种并行算法减少了数据之间的支配比较次数,提高了查询效率,效果明显,且大部分情况下算法4的查询效果最好。
[Abstract]:As a kind of preference query, Top-k computing is the most basic operation in the database, which aims to find out the information that the user may be interested in from the given data set. As an important tool of data analysis, top-k computing is widely used in web search, electronic commerce, data mining, multi-standard decision support and so on. With the advent of big data, the traditional top-k processing technology has encountered unprecedented challenges, and has been unable to meet the needs of big data analysis. In the new environment, top-k computing faces three main challenges: first, the data scale reaches the level of TB or PB, the traditional single-machine processing method is no longer applicable, the distributed parallel computing framework should be considered; Second, what kind of data partition method can be adopted to improve the parallel performance and query speed in the distributed environment in the face of massive data sets; Third, the traditional top-k query requires the user to give a rating function, but it is not easy to select a proper score function. Therefore, the key technologies of data partitioning and parallel algorithm design for top-k computing in distributed environment are studied and explored in this paper. The main research contents are as follows: (1) in the framework of distributed computing, In order to solve the problem of weighted top-k query, a similar grid data partition method is proposed. The original data set is divided into different subdatasets, and subdatasets are selected to replace all the data sets according to user preferences to reduce the query data. Aiming at the phenomenon of "empty space" in high dimension, this paper introduces hyperplane partition on the basis of grid division. Compared with the data partitioning method based on angle and hyperplane, this method is simple to preprocess without complex coordinate transformation, and it is still applicable to the phenomenon of "empty space" in higher dimensions. The experimental results show that the query speed of similar grid and hyperplane data partitioning method in big data environment is nearly 15 鈥,
本文编号:2319947
[Abstract]:As a kind of preference query, Top-k computing is the most basic operation in the database, which aims to find out the information that the user may be interested in from the given data set. As an important tool of data analysis, top-k computing is widely used in web search, electronic commerce, data mining, multi-standard decision support and so on. With the advent of big data, the traditional top-k processing technology has encountered unprecedented challenges, and has been unable to meet the needs of big data analysis. In the new environment, top-k computing faces three main challenges: first, the data scale reaches the level of TB or PB, the traditional single-machine processing method is no longer applicable, the distributed parallel computing framework should be considered; Second, what kind of data partition method can be adopted to improve the parallel performance and query speed in the distributed environment in the face of massive data sets; Third, the traditional top-k query requires the user to give a rating function, but it is not easy to select a proper score function. Therefore, the key technologies of data partitioning and parallel algorithm design for top-k computing in distributed environment are studied and explored in this paper. The main research contents are as follows: (1) in the framework of distributed computing, In order to solve the problem of weighted top-k query, a similar grid data partition method is proposed. The original data set is divided into different subdatasets, and subdatasets are selected to replace all the data sets according to user preferences to reduce the query data. Aiming at the phenomenon of "empty space" in high dimension, this paper introduces hyperplane partition on the basis of grid division. Compared with the data partitioning method based on angle and hyperplane, this method is simple to preprocess without complex coordinate transformation, and it is still applicable to the phenomenon of "empty space" in higher dimensions. The experimental results show that the query speed of similar grid and hyperplane data partitioning method in big data environment is nearly 15 鈥,
本文编号:2319947
本文链接:https://www.wllwen.com/jingjilunwen/dianzishangwulunwen/2319947.html