基于并行处理大数据图查询研究

发布时间：2018-08-02 12:39

【摘要】：随着互联网的飞速发展,我们逐渐进入一个数据为王的时代,不仅数据量变得十分巨大而且数据变得日益复杂,如何从这些多而杂的数据中查找出有用的数据已经成为一个非常迫在眉睫需要优化的问题。与此同时,在数据存储方式上分布式云存储已经成为一种常用的解决方案,于是问题就转变为基于分布式存储的数据查询。对于大规模分布式存储的数据进行按需查询,一种常用的有力的工具是图,图数据结构在具有引用关系的数据上具有很强的优势,因此针对大数据的查询就可以转化为图查询算法问题。在图查询算法中,有一大类问题就是在数据图中查询给定两个节点,回答这两个节点是不是可达的,也就是图的可达查询问题。在实际应用中,图的可达查询问题应用范围广泛,有很重要的研究意义。传统的针对图的可达查询问题的解决方法,要么限定在基于树的图查询,要么有的是针对特定的图数据库系统,这些算法大多数普遍采用索引的方法,但是在处理分布式大数据图的时候在准确性和性能上有很大的缺陷。针对这些问题,本文提出了基于Hadoop分布式计算平台下的MapReduce编程模型的并行可达图查询算法,并提出了一个基于六度可达查询的索引用来解决局部查询上的可达查询问题。通过这些算法,致力于优化分布式大图的可达查询问题,并采用多个实际应用中的数据集,从多个指标和角度,进行了多次实验评估,验证了算法的准确性和高效性。
[Abstract]:With the rapid development of the Internet, we have gradually entered an era of data king, not only the amount of data has become very large, but the data has become increasingly complex. How to find useful data from these data has become a very urgent problem to be optimized. At the same time, distributed cloud storage has become a common solution in data storage, so the problem is transformed into data query based on distributed storage. For large-scale distributed data on demand query, one of the commonly used powerful tool is graph, graph data structure has a strong advantage in referencing data. Therefore, the query for big data can be transformed into graph query algorithm. In the graph query algorithm, there is a kind of problem, which is to query the given two nodes in the data graph and answer whether the two nodes are reachable or not, that is, the reachable query problem of the graph. In practical application, the problem of reachability query of graph has a wide range of applications, which is of great significance. Traditional solutions to the problem of reachable query for graphs are either limited to tree based graph queries or specific graph database systems. Most of these algorithms generally use index methods. However, there are many shortcomings in the accuracy and performance of distributed big data diagrams. In order to solve these problems, a parallel Datuk query algorithm based on MapReduce programming model based on Hadoop distributed computing platform is proposed, and an index based on six-degree reachable query is proposed to solve the problem of local reachable query. Through these algorithms, we make great efforts to optimize the reachable query problem of distributed large graph, and use the data sets in many practical applications to carry out many experiments from many indexes and angles to verify the accuracy and efficiency of the algorithm.
【学位授予单位】：华北电力大学(北京)
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;O157.5

【参考文献】