大规模语义数据存储和查询技术研究

发布时间：2019-05-08 07:34

【摘要】：目前,语义万维网被广泛运用于包括医学、生物、地理信息服务等在内的各个领域。但是随着大数据时代的来临和应用系统规模的不断扩大,产生的语义数据也在以惊人的速度增长。传统的以关系型数据库为基础的语义数据存储管理技术和系统已无法有效存储管理大规模急速增长的语义数据,同时传统的串行化语义查询技术也难以适应大规模语义数据查询处理。在此背景下,通过并行计算技术解决大规模语义数据存储和查询已成为学术界和工业界普遍关注的热点研究问题。然而并行计算技术与应用问题紧密相关,且应用问题本身具有不同的复杂性和多样性,这使得大规模语义数据的处理具有很大的技术挑战,需要在存储、查询等方面都进行深入探讨和研究。针对上述问题,本文在对资源描述框架RDF (Resource Description Framework)和RDF数据查询语言SPARQL (Simple Protocol and RDF Query Language)等相关技术分析的基础上,利用基于工业标准OpenRDF Sesame语义数据处理框架,研究提出了一种基于HBase和Redis的大规模分布式语义数据存储和查询技术方法。该方法采用混合式索引构建分层式的存储架构以提升语义数据查询性能；在此基础上,本文进一步分析了SPAROL查询引擎的处理过程,并通过构建代价模型来对查询模型做连接查询优化,利用查询中间结果集来优化查询执行策略以保证语义数据查询的高效性；为了提高查询引擎的可靠性和可用性,本文还研究探讨了大规模语义数据存储管理和查询引擎中的容错性和可扩展性设计。最后,基于所研究设计的存储架构和查询优化方案,本文设计实现了一个大规模语义数据存储和查询原型系统。实验结果表明,所研究实现的大规模语义数据存储和查询技术方法是有效的。本文的研究工作主要分为以下两个部分：第一部分：研究现有的语义数据存储技术,设计大规模语义数据存储模型,并基于存储模型提出混合式的索引存储方法以及分层式存储架构,并提出存储架构的容错性和可扩展性解决方案。第二部分：分析语义数据查询引擎的执行流程,在查询模型优化方面,本文提出一种基于选择度估值的连接操作优化算法；在查询策略优化方面,本文提出一种自适应的批查询方案。
[Abstract]:At present, semantic World wide Web is widely used in many fields, such as medicine, biology, geographic information service and so on. However, with the advent of big data era and the continuous expansion of application system scale, semantic data is growing at an astonishing rate. The traditional semantic data storage management technology and system based on relational database can no longer effectively store and manage the large-scale rapid growth of semantic data. At the same time, the traditional serialized semantic query technology is difficult to adapt to large-scale semantic data query processing. In this context, solving large-scale semantic data storage and query by parallel computing technology has become a hot research issue in academia and industry. However, the parallel computing technology is closely related to the application problem, and the application problem itself has different complexity and diversity, which makes the processing of large-scale semantic data have great technical challenges and needs to be stored. Inquiry and other aspects of in-depth discussion and research. In order to solve the above problems, based on the analysis of resource description framework RDF (Resource Description Framework) and RDF data query language SPARQL (Simple Protocol and RDF Query Language), this paper uses the semantic data processing framework based on industrial standard OpenRDF Sesame. In this paper, a large-scale distributed semantic data storage and query technique based on HBase and Redis is proposed. In this method, the hybrid index is used to construct a hierarchical storage architecture to improve the performance of semantic data query. On this basis, this paper further analyzes the processing process of SPAROL query engine, and optimizes the join query of the query model by constructing the cost model. Using the query intermediate result set to optimize the query execution strategy to ensure the high efficiency of semantic data query; In order to improve the reliability and availability of query engine, this paper also studies and discusses the fault tolerance and extensibility design of large-scale semantic data storage management and query engine. Finally, based on the storage architecture and query optimization scheme, a large-scale semantic data storage and query prototype system is designed and implemented in this paper. The experimental results show that the proposed approach to large-scale semantic data storage and query is effective. The research work of this paper is mainly divided into the following two parts: the first part: the research of the existing semantic data storage technology, the design of large-scale semantic data storage model, Based on the storage model, a hybrid index storage method and hierarchical storage architecture are proposed, and the fault tolerance and scalability solutions of the storage architecture are proposed. In the second part, the execution flow of semantic data query engine is analyzed. In the aspect of query model optimization, this paper proposes a join operation optimization algorithm based on selection degree estimation. In the aspect of query strategy optimization, this paper proposes an adaptive batch query scheme.
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP311.13;TP333

【相似文献】