高扩展的RDF数据存储系统研究

发布时间：2018-04-03 05:26

本文选题：资源描述框架　切入点：语义数据表达　出处：《华中科技大学》2012年硕士论文

【摘要】：由于RDF（Resource Description Framework）数据具有表达灵活，数据交换方便等优点，其数据量在以惊人的速度增长。传统的RDF数据存储系统或以关系数据库为存储后端，或以本地存储方式存储数据，但是这些存储方式在存储大规模的RDF数据时都面临着扩展性问题。在存储大规模RDF数据时，需要降低数据的存储空间并加速查询处理。但目前提出的存储方式不够紧凑且存在大量的冗余数据，导致在生成查询计划和执行过程中消耗了大量的时间。高扩展的RDF数据存储系统TripleBit旨在为大规模RDF数据提供一个高效的存储和查询方案。利用RDF数据特点，系统将RDF数据表达成一个位图矩阵。为了降低数据占用空间，在数据存储时针对各个数据表特征和作用设计了相应的压缩算法。在底层具体存储时采用了基于内存的存储方式降低了系统在存储和查询时的I/O开销，，并采用了数据分块的存储方法，既使得存储管理方便又使得存储结构紧凑，加速了查询处理。为了提高RDF数据查找的速度，系统设计了两类索引分别加速系统数据块的定位和谓词未知的查询处理。在查询RDF数据时，系统基于启发式规则简单有效地生成查询计划。在执行查询计划时，根据查询类型采用不同的执行策略，并利用并行执行子系统提高连接查询操作的效率。对于多变量的查询计划，采用二步执行策略减少查询过程中产生的中间结果，并动态地调整查询计划。与目前流行RDF数据存储系统RDF-3X进行性能对比测试的结果表明，在存储空间上比RDF-3X至少降低了40%，在查询性能上比RDF-3X至少提升了3倍。实验进一步表明，TripleBit所采用的查询计划生成方式和索引技术对查询处理性能的提升有很大的帮助。
[Abstract]:Due to the advantages of flexible expression and convenient data exchange, the data volume of RDF(Resource Description Framework is increasing at an amazing speed.Traditional RDF data storage systems either use relational databases as the backend or store data locally, but these storage methods are faced with the problem of scalability when storing large-scale RDF data.When storing large-scale RDF data, it is necessary to reduce the storage space and speed up query processing.However, the proposed storage method is not compact enough and there is a large amount of redundant data, which results in a lot of time spent in the process of generating query plan and execution.The high-extended RDF data storage system (TripleBit) aims to provide an efficient storage and query scheme for large-scale RDF data.Based on the characteristics of RDF data, a bitmap matrix of RDF data table is obtained.In order to reduce the data footprint, a compression algorithm is designed for the features and functions of each data table.At the bottom of the storage system, the memory based storage method is used to reduce the I / O overhead of the system when storing and querying, and the data block storage method is adopted, which makes the storage management convenient and the storage structure compact.The query processing is accelerated.In order to improve the speed of RDF data search, two kinds of indexes are designed to accelerate the location of system data blocks and query processing with unknown predicates, respectively.When querying RDF data, the system generates query plan simply and effectively based on heuristic rules.When the query plan is executed, different execution strategies are adopted according to the query type, and the parallel execution subsystem is used to improve the efficiency of the join query operation.For multivariable query plan, two-step execution strategy is used to reduce the intermediate results and dynamically adjust the query plan.The results of performance comparison with RDF-3X, a popular RDF data storage system, show that the storage space is at least 40 less than that of RDF-3X, and the query performance is at least three times higher than that of RDF-3X.The experiment further shows that the query plan generation and index technology used by TripleBit can greatly improve the performance of query processing.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP333;TP391.3

【参考文献】