当前位置:主页 > 科技论文 > 搜索引擎论文 >

面向大规模知识图谱的分布式查询技术研究

发布时间:2019-06-28 16:46
【摘要】:随着大数据时代的到来,人们所采集的数据量已达到ZB级规模。为了精确查询数据,越来越多的搜索引擎采用知识图谱作为底层数据支撑。知识图谱是描述现实世界中地点、人物、城市、电影等事物以及事物间联系的关系网络。利用知识图谱,搜索引擎可挖掘事物之间的内在联系,更准确地查找用户所需的信息。目前知识图谱中的数据主要从Wikipedia等知识百科中自动采集,存在大量未经验证的信息,导致知识图谱呈现出噪声数据多且数据规模大的特征,这些特征使得用户难以快速获取满意的查询结果。针对以上特征,如何实现快速高效的知识图谱查询是当前学术界和工业界亟待解决的问题。现有工作通常将知识图谱查询建模成子图匹配问题,并已取得一定进展,但仍存在诸多不足。首先,现有查询模型大多要求查询结果与用户查询精确匹配,但是由于知识图谱存在噪声数据,这些模型会遗漏用户感兴趣的查询结果,存在可用性差的问题。其次,为了加快查询速度,现有查询算法普遍采用图索引技术,但是知识图谱的数据规模大,为其建立图索引需耗费高昂的时间和空间开销。最后,由于知识图谱规模庞大,所以需要采用分布式的方式实现查询过程,然而现有的分布式图数据处理平台未针对知识图谱查询的执行过程进行优化,存在执行效率低下的问题。因此,需设计新型的知识图谱查询模型、算法和计算平台以应对以上挑战。本文针对知识图谱噪声数据多、数据规模大的特征,分别从知识图谱查询模型、分布式查询算法、分布式查询执行优化三个层面对知识图谱查询问题展开研究,旨在提供快速高效的新型分布式查询技术。第一,提出一种面向知识图谱的查询模型,基于模糊匹配的思想屏蔽噪声数据,始终保证返回满意的查询结果。第二,基于本文所提的查询模型,设计一种免索引的分布式查询算法,通过新型的限界技术优化查询时间,利用分布式环境的计算能力加快查询速度,达到快速响应查询请求的目的。第三,在分布式图数据处理平台上,分别从作业调度和数据存储两个方面优化分布式知识图谱查询的执行效率,减少数据I/0的开销,进一步缩短查询的整体完成时间。在理论研究的基础上,设计与实现面向大规模知识图谱的搜索引擎原型系统,部署面向学术文献知识图谱的查询应用,以验证本文的理论成果的有效性。综上所述,本文针对知识图谱的两个特征,提出快速高效的分布式查询技术,保证用户可以快速获取满意的查询结果,为下一代搜索引擎提供行之有效的解决方案。随着知识图谱的不断普及,本文的研究成果将应用于商业、金融、生命科学等诸多领域,为商业决策、金融分析、生物制药等应用提供有效的数据查询支持,具有重大的社会意义。
[Abstract]:With the advent of big data era, the amount of data collected by people has reached the ZB level. In order to query data accurately, more and more search engines use knowledge graph as the underlying data support. Knowledge graph is a network of places, characters, cities, movies and the relationship between things in the real world. By using knowledge graph, search engine can mine the internal relationship between things and find the information needed by users more accurately. At present, the data in the knowledge graph are mainly collected automatically from Wikipedia and other knowledge encyclopedia, and there are a lot of unverified information, which leads to the characteristics of large noise data and large data scale in the knowledge graph, which makes it difficult for users to obtain satisfactory query results quickly. In view of the above characteristics, how to realize fast and efficient knowledge graph query is an urgent problem to be solved in academic and industrial circles. At present, knowledge graph query is usually modeled as subgraph matching problem, and some progress has been made, but there are still many shortcomings. First of all, most of the existing query models require that the query results match the user query accurately, but because of the noise data in the knowledge graph, these models will miss the query results that users are interested in, and there is a problem of poor availability. Secondly, in order to speed up the query speed, the existing query algorithms generally use graph index technology, but the data scale of knowledge graph is large, so it takes a high time and space cost to establish graph index for it. Finally, because of the large scale of knowledge graph, it is necessary to realize the query process in a distributed way. However, the existing distributed map data processing platform does not optimize the execution process of knowledge graph query, and there is a problem of low execution efficiency. Therefore, it is necessary to design a new knowledge graph query model, algorithm and computing platform to meet the above challenges. In view of the characteristics of knowledge graph noise data and large data scale, this paper studies the knowledge graph query problem from three aspects: knowledge graph query model, distributed query algorithm and distributed query execution optimization, in order to provide a new fast and efficient distributed query technology. First, a knowledge graph oriented query model is proposed, which shielded noise data based on fuzzy matching and always guaranteed to return satisfactory query results. Secondly, based on the query model proposed in this paper, an index-free distributed query algorithm is designed. The query time is optimized by a new bound technology, and the query speed is accelerated by using the computing power of distributed environment, so as to achieve the purpose of responding to query requests quickly. Thirdly, on the distributed map data processing platform, the execution efficiency of distributed knowledge graph query is optimized from two aspects of job scheduling and data storage, the overhead of data I 鈮,

本文编号:2507457

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2507457.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户e834b***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com