数据空间集成与查询关键技术研究

发布时间：2018-04-15 03:36

本文选题：数据空间 + 数据模型　；参考：《哈尔滨工程大学》2016年博士论文

【摘要】：在过去的十年,互联网、云计算、大数据以及移动互联等技术得到蓬勃发展,这使得当前数据呈现出体量巨大、种类繁多、动态演化和松散关联等新特点。传统的数据库管理技术无法管理这样的数据,因此,研究新的数据管理技术来驾驭这些数据就显得尤为必要。数据空间技术应运而生,并引起数据库社区和工业界广泛关注。然而,数据空间在数据集成与数据查询方面仍然存在许多尚未(或未完全)解决的问题。例如,缺少表示异构数据以及复杂语义关系的数据模型;缺少面向动态演化环境下的数据空间实体划分技术;缺少支持具有高倾斜分布、大规模异构数据的多维索引技术;缺少无缝搜索异构数据、表达力较强的近似查询技术等。本文立足于数据空间集成与数据查询方面的研究,旨在能够统一地管理各种结构化、半结构化与非结构化数据,并且能够高效地、无缝地搜索这些异构数据,从而为“Pay-as-you-go”方式集成数据提供基本保障,进而提供“Best-effort”的数据空间查询服务。针对上述问题,本文将从以下方面展开深入细致的研究。首先,针对数据空间中异构数据具有上下文依赖性以及语义关系具有复杂性特点,对数据空间表示模型进行了研究。通过一个案例分析了传统数据空间模型(如解释对象模型)的缺陷,提出了一种基于上下文感知的复杂语义关联网络模型(COSAN)。具体而言,(1)在传统解释对象模型基础上,考虑异构数据的上下文依赖性,形式化地定义了上下文感知的异构数据表示方法。该方法把上下文信息与数据源的结构化、半结构化以及非结构化信息统一封装为上下文感知的解释对象,从而表达上下文感知的异构信息;(2)为克服传统数据模型只能表示简单二元语义关系的缺陷,通过一组约束组件(如上下文约束、顺序约束和聚合约束等)扩展了传统的二元语义关系,形式化地表示了复杂语义关系;(3)在公开数据集DBLP上进行了大量实验,实验结果验证了该模型的有效性和可行性。其次,针对数据空间实体具有信息丰富性、类别滞后性以及动态演化性特点,对面向数据空间的实体划分技术进行了研究,提出了一种基于演化K-Means的数据空间实体划分方法。具体而言,(1)提出了一种基于轮廓值和KL-散度的演化K-Means聚类框架。该框架不仅考虑当前聚簇的质量(即,快照代价),还考虑了若干典型的历史聚簇结构的时间平滑性(即,历史代价);(2)通过综合使用实体自身的丰富信息和实体间的历史出现模式信息,设计了一种面向数据空间实体的相似性度量方法,从而较准确地度量实体间的相似性;(3)根据启发式规则,提出了一种基于相似性密度的演化K-Means聚类算法,较好地解决了初始点选择问题和在演化环境中数据空间实体划分问题;(4)扩展了演化K-Means聚类框架,以处理簇数量随时间发生变化、快照实体随时间加入或移除的情况;(5)在公开数据集DBLP上进行了大量实验,实验结果表明本方法优于传统已有的方法,它不仅能高质量地捕获当前实体聚类结果,还能健壮地反映历史聚簇情况。再次,针对传统数据空间索引方法无法适用于高倾斜分布的大规模数据的问题,从负载均衡和划分角度对数据空间多维索引技术进行了研究,提出了一种基于负载均衡和查询日志的数据空间多维索引方法,旨在保持各个索引节点负载均衡、减少查询通信开销、提高数据空间查询处理性能。具体而言,(1)在垂直划分中,聚合在查询日志和实体中频繁出现的token词,以减少查询涉及倒排列表的聚合/合并开销。在此基础上,结合超图理论和用户查询与倒排列表间访问模式信息,把垂直划分问题进一步归约为超图划分问题,从而保持垂直划分的负载均衡;(2)在水平划分中,结合超图理论和用户查询与实体间访问模式信息,把水平划分问题归约为超图划分问题,从而保持水平划分的负载均衡;(3)结合垂直划分和水平划分策略,构建了二维混合索引。在此基础上,从查询吞吐量与容错率角度考虑,利用索引副本策略,进一步扩展为三维索引;(4)在公开数据集DBLP上进行了大量实验,实验结果表明本方法在吞吐量、查询响应时间及扩展性等方面优于已有方法。最后,针对传统数据空间查询语义、查询结构较简单的缺陷,对面向数据空间的top-k近似子图查询技术进行了研究,提出了一种基于邻域结构的top-k近似子图查询方法。具体而言,(1)形式化地定义了数据空间中top-k近似子图查询问题,在图管理理论基础上,提出了一种新型的数据空间查询语言GQL;(2)通过综合利用顶点距离邻近性信息和边标签分布性信息,设计了一种基于邻域结构的图相似性函数;(3)基于索引技术和邻域结构特征,提出了一种基于邻域结构的匹配顶点剪枝算法,从而剪枝掉大量无希望的候选匹配顶点;(4)通过考虑顶点剪枝策略和顶点匹配顺序,提出了一种面向数据空间的top-k近似子图搜索算法;(5)在真实数据集DBLP上进行了大量实验,实验结果表明该方法在查询效果、查询效率和扩展性方面明显优于已有方法。
[Abstract]:In the past ten years, the Internet, cloud computing, big data and mobile Internet technology is developing rapidly, which makes the data showing a huge volume, variety, new characteristics of dynamic evolution and loosely related. The traditional database management technology to manage such data, therefore, research on new data management technology to manage these data is particularly necessary. Data space came into being, and cause the database community and industry wide attention. However, many still do not exist in the data space, data integration and data query (or not) to solve the problem. For example, the lack of heterogeneous data and complex semantic relation data model; lack of data space entity techniques for dynamic evolution of environment; lack of support with high skew distribution, multidimensional indexing technology of heterogeneous data; lack of seamless search Cable heterogeneous data, expressive approximate query technology. This paper is based on the spatial data integration and data query research, to unified management of various structured, semi-structured and unstructured data, and can efficiently and seamlessly search these heterogeneous data, so as to provide the basic guarantee for the integration of "Pay-as-you-go" the data, which provides "Best-effort" data spatial query service. Aiming at the above problems, this paper will research deeply from the following aspects. Firstly, according to the data space of heterogeneous data with context dependent complexity characteristics and semantic relations, said model to study data space. Data were analyzed by the traditional space model a case (such as object model) defects, presents a complex semantic correlation network model based on context awareness (CO SAN). Specifically, (1) based on the traditional interpretation of the object model, considering the heterogeneous data context dependent, the formal definition of the context aware heterogeneous data representation method. This method takes the structure of context information and data sources, semi-structured and unstructured information unified package for context aware object thus, the expression of heterogeneous information context; (2) to overcome the traditional data model can express the defect simple $two semantic relations, through a set of constraint components (such as context constraints, sequence constraints and polymerization constraints etc.) extends the semantic relations of the traditional two yuan, to formally represent the complex semantic relations; (3) in the public data set DBLP on a large number of experiments, the experimental results verify the feasibility and validity of the model. Secondly, with rich information in data space lag and dynamic entity category Evolution characteristics of oriented data space entity partitioning technology research, put forward a method of data space entity partitioning based on evolutionary K-Means. Specifically, (1) proposed a framework for clustering and evolution of K-Means profile based on KL- divergence. The framework not only consider the quality of the clusters (that is, the price), snapshot also takes into account the time smoothness of some typical historical cluster structure (i.e., historical cost); (2) through the comprehensive use of solid rich information and the history between the pattern information, design a similarity measure method for spatial entity data, thus more accurate to measure the similarity between entities; (3) according to the heuristic rules, presents a similar evolution of K-Means clustering algorithm based on density, better solves the initial selection problem and divides the data space entities in the evolution of environment The problem; (4) expansion of the evolution of the K-Means clustering framework to deal with the number of clusters change with time, the snapshot entity with time to add or remove the situation; (5) in the public data set DBLP on a large number of experiments, the experimental results show that this method is superior to the traditional method of existing, it can not only capture the current high quality the entity clustering results, but also robust to reflect the history of cluster. Thirdly, the traditional data spatial index method is not suitable for large-scale data in high inclined distribution problems, from the angle of load balancing and division of data space Treviso cited Technology research, this paper proposes a data space multidimensional index method of load balancing and query log based on each index node to keep the load balance, reduce the query communication costs, improving the spatial data query processing performance. Specifically, (1) in the vertical partition, aggregation in the query The frequent log and entities in token, in order to reduce the overhead associated with queries involving aggregation / inverted list. On this basis, combining hypergraph theory and user query and inverted list access pattern information, the vertical partition problem is further reduced to a hypergraph partitioning problem, and keep the load balance from the vertical division; (2 at the level of division,) combining hypergraph theory and user query and entity access pattern information, the hypergraph partitioning problem level partition problem reduction, so as to maintain load balancing level; (3) combined with the vertical and horizontal partition partition strategy, construct the two-dimensional hybrid index. On this basis, from the query throughput and fault tolerance point of view, using the index replication strategy, further extended to 3D index; (4) in the public data set DBLP on a large number of experiments, the experimental results show that this method in throughput and query response time And scalability is superior to existing methods. Finally, in view of the traditional spatial data query semantics, query the defect structure is relatively simple, Top-k oriented data space approximate subgraph query technology, proposes a neighborhood structure based on the Top-k approximate subgraph query method. Specifically, (1) formal the definition of the data space Top-k approximate subgraph query problem in graph management based on the theory, proposed a new spatial data query language GQL; (2) by using vertex distance proximity information and edge label distribution information, a similarity function is designed based on the neighborhood structure diagram (; 3) and neighborhood index technology based on the structure characteristic and propose a matching vertex pruning algorithm based on neighborhood structures, thus pruning out a lot of hopeless candidate matching points; (4) by considering the vertex and vertex matching pruning strategy In order to match the order, we propose a data oriented Top-k approximate subgraph search algorithm. (5) a lot of experiments have been done on the real data set DBLP. The experimental results show that this method is superior to the existing methods in query efficiency, query efficiency and scalability.

【学位授予单位】：哈尔滨工程大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】