基于模糊查询的大数据分析处理系统的研究与实现

发布时间：2018-05-07 06:54

本文选题：在线聚集 + 样本　；参考：《浙江大学》2017年硕士论文

【摘要】：随着大数据分析技术的日渐成熟,大数据所蕴含的巨大价值已经越来越被重视。由于数据量巨大,对大数据进行分析一般是很耗费时间的。然而,在很多情况下,用户并不需要精确的查询结果,数据大概的轮廓就可以满足大部分的分析需求。本文研究并实现了一种基于模糊查询的大数据分析处理系统。该系统为用户定义了一套查询接口,这些接口支持用户进行各种聚集查询(Group By)。系统将会为用户查询返回一个模糊结果。本系统可以在秒级内返回上百G数据的模糊查询结果。利用在线聚集技术可以快速生成数据轮廓的特点,本文将在线聚集技术应用到了系统中。同时,系统中相邻查询得到的结果集是有交叠的,如果能够将系统已经处理的查询所采集到的样本和计算出的中间结果保存起来,就可以加速系统处理后面查询的速度。基于此,本文对在线聚集技术做了优化。首先,本文对数据集进行随机化处理,生成一个随机数据集,这样,就可以通过顺序扫描随机数据集来达到在数据集中随机采样的效果。然后,本文通过在线聚集技术处理用户的查询请求。在线聚集技术在生成查询结果的同时,会把已经获取的样本和产生的中间结果存储在一棵样本管理树中。相应的,用户的查询也会首先在这棵树中进行处理。当在树中查询到的结果不能满足用户的需求时,系统再从数据源读取数据。通过这种方式,在线聚集技术中采取的样本和中间结果可以有效地被多个查询使用。同时,本文还提供了一种整合多个中间结果的方法,以生成最终查询结果。最后,通过在TPC-H基准上的实验结果,验证了本文所设计并实现的系统的有效性。
[Abstract]:With the maturation of big data's analytical technology, the great value contained by big data has been paid more and more attention. Because of the huge amount of data, big data is generally very time-consuming analysis. However, in many cases, users do not need accurate query results, the profile of the data can meet most of the analysis requirements. This paper studies and implements a big data analysis and processing system based on fuzzy query. The system defines a set of query interfaces for users. The system will return a fuzzy result for the user query. The system can return the fuzzy query results of hundreds of gigabytes in seconds. In this paper, the on-line aggregation technique is applied to the system. At the same time, the result sets of the adjacent queries in the system are overlapped. If we can save the samples collected from the queries processed by the system and the intermediate results calculated, we can speed up the processing of the later queries. Based on this, this paper optimizes the technique of online aggregation. First, the data set is randomly processed to generate a random data set, so that the random data set can be scanned sequentially to achieve the effect of random sampling in the data set. Then, this paper deals with the query request of the user through the online aggregation technology. While generating query results, the online aggregation technique stores the obtained samples and the generated intermediate results in a sample management tree. Accordingly, the user's query is first processed in this tree. When the query results in the tree can not meet the needs of the user, the system reads the data from the data source. In this way, the samples and intermediate results taken in the online aggregation technique can be effectively used by multiple queries. At the same time, this paper also provides a method to integrate multiple intermediate results to generate the final query results. Finally, the effectiveness of the system designed and implemented in this paper is verified by the experimental results on the TPC-H benchmark.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】