面向图数据的Top-k检索算法研究

发布时间：2018-08-20 15:44

【摘要】：近年来随着社交网络、知识图谱等应用的飞速发展,图数据大量出现在学术界和工业界,如何有效管理图数据并从中挖掘有价值的信息已经成为当前数据管理领域的研究热点。其中面向图数据的Top-k检索问题广泛存在于在各类应用中,旨在从图中找出与查询最相关的k个匹配结果,是图数据管理的一个重要研究问题。针对当前图数据的数据规模大、节点和边的种类多、内容动态变化等特点,本研究重点关注面向图数据的Top-k检索中三个关键问题:分布式环境的Top-k子图检索、多数据源的Top-k子图检索和动态数据的持续相关Top-k检索,归纳总结已有工作的优点和不足,提出相应改进算法。本研究的创新点主要有:·提出基于星形查询拆分的分布式Top-k子图检索算法。算法基于星形子图拆分策略,包含对原始查询进行子查询拆分、检索子查询匹配结果、合并子查询匹配结果三个主要步骤。本研究基于算法停止条件的相关特点对算法子查询执行和合并子查询匹配结果过程进行了三方面的优化,进一步降低了算法的跨节点数据传输开销。实验结果表明:算法具有很好的可扩展性,优化后算法执行时间比已有方法平均低30%-40%。·提出面向多数据源Top-k子图检索的排序连接算法。算法包含构造来自不同数据源的匹配结果列表、连接结果列表并计算所得完整结果的分数和判断算法停止条件是否满足三个主要步骤。针对已有方法计算尚未访问结果得分上界松弛的问题,本研究提出优化的匹配结果列表排序函数和基于最小下降分数的结果列表读取策略,并从理论上证明了排序函数最优参数配置应该满足的条件,以及优化读取策略产生的结果访问顺序最优。实验结果表明:本研究提出的排序连接算法比已有工作开销低50%到70%。·提出基于动态区间窗口划分的持续相关Top-k检索算法。算法基于从时间维度对问题空间进行单位时间区间划分的思想,在每个区间执行Top-k检索,再归并计算持续相关Top-k结果。针对单位区间划分可扩展性差的问题,本研究进一步提出基于动态区间窗口划分的算法,检索过程中动态维护时间窗口,避免不必要区间划分,减少了子问题的数量,同时合并不同区间重复的计算子过程,提升了算法的空间和时间效率。实验结果表明:基于动态区间窗口划分算法在消耗更少内存的情况下,执行时间比已有方法低50%以上。
[Abstract]:In recent years, with the rapid development of social network, knowledge map and other applications, graph data appear in academia and industry. How to effectively manage graph data and mine valuable information has become a research hotspot in the field of data management. The problem of graph data oriented Top-k retrieval is widely used in all kinds of applications. It is an important research problem in graph data management to find k matching results that are most relevant to query from graph. In view of the characteristics of large scale of graph data, variety of nodes and edges and dynamic change of content, this paper focuses on three key problems in Top-k retrieval for graph data: Top-k subgraph retrieval in distributed environment. The Top-k subgraph retrieval of multiple data sources and the continuous correlated Top-k retrieval of dynamic data are summarized, and the advantages and disadvantages of existing work are summarized, and the corresponding improved algorithm is proposed. The main innovations of this study are as follows: a distributed Top-k subgraph retrieval algorithm based on star query splitting is proposed. The algorithm is based on star subgraph splitting strategy, which consists of three main steps: subquery splitting, subquery matching, merging subquery matching. Based on the characteristics of the algorithm stopping condition, this paper optimizes the performance of the algorithm subquery and the matching process of the merged sub-query, and further reduces the cross-node data transmission overhead of the algorithm. The experimental results show that the algorithm has good scalability and the execution time of the optimized algorithm is 30 to 40 less than that of the existing methods. A sorting join algorithm for multi-data source Top-k subgraph retrieval is proposed. The algorithm consists of constructing a list of matching results from different data sources, connecting the result list, calculating the score of the resulting complete result and judging whether the stopping condition of the algorithm meets the three main steps. To solve the problem of calculating the upper bound relaxation of unvisited results by existing methods, this paper proposes an optimized matching result list sorting function and a result list reading strategy based on the minimum descent score. It is proved theoretically that the optimal parameter configuration of the sort function should satisfy the conditions and the optimal access order of the results generated by the optimized reading strategy. The experimental results show that the proposed sorting connection algorithm is 50% to 70% lower than the existing overhead, and a continuous correlation Top-k retrieval algorithm based on dynamic interval window partition is proposed. Based on the idea of dividing the problem space into unit time intervals from the time dimension, the algorithm performs Top-k retrieval in each interval, and then merges and calculates the sustained correlation Top-k results. In order to solve the problem of poor scalability of unit interval partitioning, this paper proposes an algorithm based on dynamic interval window partitioning, which maintains the time window dynamically in the retrieval process, avoids unnecessary interval partitioning, and reduces the number of sub-problems. At the same time, the computation subprocesses with different interval repetition are combined to improve the space and time efficiency of the algorithm. The experimental results show that the execution time of the algorithm based on dynamic interval window partition is more than 50% lower than that of the existing methods under the condition of less memory consumption.
【学位授予单位】：清华大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.3

【相似文献】