一种基于Spark的语义推理引擎实现及应用

发布时间：2018-12-09 13:15

【摘要】：近些年在知识图谱蓬勃发展的大背景下,与之相关的语义Web的数据规模也呈现爆发态势。如何在大规模语义Web数据上有效地进行语义推理是研究者们面临的棘手问题。具体来说,在大规模语义Web数据上实施语义推理时,计算量巨大、消耗时间长都是突出的问题,特别是当应用复杂规则逻辑进行推理时,情况更是如此。传统单机环境下的语义推理引擎无法应对大规模知识图谱下的推理,缺乏可扩展性方面的考虑,难以满足在数据规模上日益增长的语义关联数据的推理需求。从分布式角度来看,已有的基于Hadoop MapReduce实现的语义推理框架由于欠缺推理算法相关的网络通信和磁盘I/O等的优化,推理效率依然较低。本文针对上述问题,围绕分布式内存计算平台Spark,研究以下几个方面的内容:首先设计一个良好模块化且推理规则可配置的完整分布式推理引擎架构。接着研究现有的单机和分布式语义推理算法,基于Spark框架对相关算法进行分布式的实现,并针对Spark的原理和特点做相应的优化。将基于Spark实现的推理引擎与现有的传统分布式推理引擎在推理效率上进行对比实验。实验结果表明,本文设计的基于Spark的语义推理引擎在推理效率上要远好于以Hadoop MapReduce为代表的推理实现,同时兼具了高可扩展性。最终将本系统应用到物联网领域,适应实时和流式的语义数据流处理和推理场景。
[Abstract]:In recent years, with the rapid development of knowledge map, the data scale of semantic Web, which is related to it, has also taken on an explosive trend. How to effectively perform semantic reasoning on large scale semantic Web data is a difficult problem for researchers. Specifically, when implementing semantic reasoning on large scale semantic Web data, it is an outstanding problem that the computation is huge and the time is long, especially when the reasoning is based on the logic of complex rules. The traditional semantic reasoning engine in single machine environment can not cope with the reasoning under large-scale knowledge atlas, and it is difficult to meet the reasoning needs of the increasing data scale of semantic association data due to the lack of scalability considerations. From a distributed point of view, the existing semantic reasoning framework based on Hadoop MapReduce is still inefficient due to the lack of network communication related to reasoning algorithm and optimization of disk I / O. Aiming at the above problems, this paper studies the following aspects around the distributed memory computing platform Spark,: firstly, a complete distributed reasoning engine architecture with good modularization and configurable reasoning rules is designed. Then the existing single machine and distributed semantic reasoning algorithms are studied. The distributed implementation of the related algorithms based on the Spark framework is carried out and the corresponding optimization is made according to the principle and characteristics of Spark. The reasoning engine based on Spark is compared with the traditional distributed reasoning engine in reasoning efficiency. The experimental results show that the semantic reasoning engine based on Spark is much more efficient than the reasoning implementation represented by Hadoop MapReduce, and it also has high scalability. Finally, the system is applied to the field of Internet of things, which adapts to real-time and streaming semantic data flow processing and reasoning scenarios.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.52

【参考文献】