基于知识库与云平台的海量数据存储与查询设计与实现

发布时间：2018-07-03 12:51

本文选题：RDF + 存储　；参考：《北京邮电大学》2017年硕士论文

【摘要】：在互联网飞速发展的时代背景下,数据规模正在飞速增长,这些数据主要来自不同数据源的异构数据。知识图谱在信息搜索领域的成功应用促进了异构数据的融合,存储和查询的研究。本体使用唯一标识符对互联网上的资源进行标记,并可以在资源之上添加自身属性和资源之间的关系属性,具有较大的灵活性和可扩展性。随着语义web的兴起,经过几十年的发展,本体被广泛应用于异构数据的表达,被公认为是一种有效的解决方案。近年来,在计算机领域,涌现出很多基于本体对数据进行管理和应用的相关研究。传统的存储方法将不同类目的信息存储在不同的表中,导致搜索结果单一,无法满足用户需求。随着网络规模和多源数据量的增加,传统的数据库存储方案和单机环境难以支持海量数据的存储与查询。因此,越来越多的云平台与分布式系统的解决方案被应用到数据存储与查询领域。虽然基于分布式系统的研究尚不成熟,但很有研究意义与发展前景。本文基于云平台Hadoop和非关系型数据库HBase,研究海量异构数据的融合,存储和查询。主要工作如下:1.首先,作为后续分布式存储与查询的基础,实现了多源异构数据的融合。本文通过并行化计算框架MapReduce实现并行化本体构建与融合。在构建过程中,将不同源的数据分别构建为类别单一的本体。在融合过程中,对不同源的数据进行融合,生成类别和语义丰富的本体。2.随着数据爆炸式增长,传统的存储方法在导入性能和对单机存储硬件需求这两方面的瓶颈日益凸显。参考近年的分布式RDF数据存储方案,本文综合考虑存储空间及后续对查询的响应速度这两个因素,设计了基于HBase的存储模型。3.在HBase存储模型之上,分别设计了三元组模式查询,基本图模式查询和关键词查询的查询策略。三元组模式查询是基本图模式查询的基础,它的响应速度由两方面决定:数据库的表设计,数据库本身的索引性能。此外,通过分析复杂基本图模式查询的结构规律,提出了基于连接操作的优化方法。关键词查询的研究意义在于提升查询引擎的易用性,本文提出的关键词搜索方法利用了基本图模式查询的研究成果,达到较好的性能。通过在LUBM数据集上进行试验,验证了策略的有效性和高效性。
[Abstract]:Under the background of the rapid development of the Internet, the scale of data is growing rapidly, which mainly comes from heterogeneous data from different data sources. The successful application of knowledge map in the field of information search promotes the research of heterogeneous data fusion, storage and query. Ontology uses unique identifiers to mark resources on the Internet and can add its own attributes to the resources and the relationship between the resources. It is flexible and extensible. With the rise of semantic web, ontology has been widely used in the expression of heterogeneous data after decades of development. It is recognized as an effective solution. In recent years, there have been a lot of ontology-based data management and application research in the field of computer. The traditional storage method stores the information of different categories in different tables, resulting in a single search result, which can not meet the needs of users. With the increase of network scale and multi-source data, the traditional database storage scheme and single machine environment can not support the storage and query of massive data. Therefore, more and more cloud platforms and distributed system solutions are applied to data storage and query. Although the research based on distributed system is not mature, it has great significance and development prospect. Based on the cloud platform Hadoop and the non-relational database HBasethis paper studies the fusion storage and query of massive heterogeneous data. The main work is as follows: 1. Firstly, as the foundation of the subsequent distributed storage and query, the fusion of multi-source and heterogeneous data is realized. In this paper, parallel ontology construction and fusion are realized by parallel computing framework MapReduce. In the process of building, the different data are constructed into a single ontology. In the process of fusion, the data of different origin are fused to generate a class and semantic rich ontology. 2. With the explosive growth of data, the bottleneck of traditional storage methods in the import performance and the demand for single-machine storage hardware has become increasingly prominent. Referring to the distributed RDF data storage scheme in recent years, this paper designs a storage model based on HBase. Based on the HBASE storage model, the query strategies of triple mode query, basic graph schema query and keyword query are designed respectively. Triple schema query is the basis of basic graph schema query. Its response speed is determined by two aspects: database table design and database itself index performance. In addition, an optimization method based on join operation is proposed by analyzing the structure of complex basic graph schema query. The research significance of keyword query is to improve the ease of use of query engine. The keyword search method proposed in this paper makes use of the research results of basic graph pattern query to achieve better performance. The effectiveness and efficiency of the strategy are verified by experiments on the LUBM dataset.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;TP393.09

【参考文献】