面向大规模RDF数据的语义搜索

发布时间：2018-01-30 04:27

本文关键词： 语义搜索混合查询图数据索引查询优化实体匹配查询翻译排序　出处：《上海交通大学》2013年博士论文　论文类型：学位论文

【摘要】：语义万维网通过赋予信息明确的结构和语义,使得机器不仅可以显示这些信息,更能够理解、处理和整合它们。近年来,随着链接开放数据和DBpedia等项目的全面展开,语义Web数据源的数量激增,大量以RDF为数据模型的图结构语义数据被发布。互联网正从仅包含网页和网页之间超链接的文档万维网转变成包含大量描述各种实体和实体之间丰富关系的数据万维网。在这种背景下,以谷歌为代表的各大搜索引擎公司纷纷以此为基础构建知识图谱来改善搜索质量,从而拉开了语义搜索的序幕。与传统的文档检索不同,语义搜索需要处理粒度更细的结构化语义数据,因此也面临着更大的前所未有的挑战。原有成熟的针对非结构化的Web文档的存储与索引技术对RDF数据不再适用。现有的排序算法也不能直接应用到面向实体和关联的语义搜索中。SPARQL查询支持和面向异构语义数据源的数据整合是全新的问题。此外,支持用户熟悉的关键词查询对于语义搜索推广的至关重要。本文旨在全面系统地解决了面向大规模RDF数据的语义搜索所面临的挑战：支持大规模图数据存储与索引,支持包含关键词的图结构查询,支持以实体为中心的结构化排序,支持面向多数据源的异构数据融合,和支持友好的用户交互等。论文各章的主要内容和贡献如下列出：第一章为绪论,介绍了研究背景,总结了语义搜索的国内外研究现状并详细描述了面向大规模RDF的语义搜索所面临的主要挑战。第二章首次使用信息检索的方法来搜索数据万维网。通过利用和扩展倒排索引来支持高效的单变量树型混合式查询处理。在此基础上,我提出了一种基于关系的排序算法来返回相关的实体,使用分面浏览来允许用户交互性地构造混合式查询,以及基于块的索引来支持增量式索引更新。第三章扩展了第二章的结构化查询能力,提出了一个高效的RDF查询引擎来执行更一般的SPARQL查询。此外,我通过收集特定的RDF统计信息来估计查询计划的执行代价,并设计了一个全新的查询优化算法来确定最优的联结顺序,将SPARQL查询图转换为最优的查询计划。第四章讨论了基于RDF图模式的高效查询处理。本章介绍了两种模式选择策略,一种通过启发式规则来选择RDF频繁子图,另一种使用查询历史来选择用户偏好的子图结构。在前两章的基础上,我进一步提出基于图模式的高效索引,通过模式树来表示查询计划,并将SPARQL查询转换为子模式覆盖问题来解决。第五章提出了一个二阶段整合的解决方案来解决面向大规模RDF图数据的语义搜索中的实体匹配问题。通过分块来快速筛选候选实体对以解决可扩展性方面的问题。接着,利用实体的局部结构特性在每个分块内部进行聚类,取得最终的匹配结果。本项工作也是首次尝试通过利用开放链接数据中现有的sameAs三元组在大规模场景下进行广泛的实体匹配效果评估。第六章研究了一种新颖且友好的关键词搜索交互方式,即在大规模图数据(特别是RDF数据)上如何进行高效的关键词查询翻译。我提出了一个新颖的前k子图搜索算法,将关键词查询转化为结构化查询,而不是直接计算查询结果。我还利用摘要技术来生成只包含图模式信息的聚合图,来加速查询翻译过程。第七章介绍了一个支持按需支付数据整合的数据万维网搜索基础架构。本章将查询翻译扩展到在异构的万维网数据源上,即将用户关键词翻译为一个跨越多个数据源的语义结构化查询。此外,我详细介绍了数据万维网上进行分布式查询处理的技术,特别是映射联结。它利用第五章提到的大规模实体匹配方法来预先计算数据层映射,并对从异构数据源中获得的结果进行高效合并。第八章将语义搜索应用场景扩展到同时包含图结构数据、网页以及相应语义标注的混合网络环境中。通过整合信息检索和数据库技术来构建一个可以扩展到大量文档、图结构数据和语义标注的数据库。此外,我提出了一个新颖的数据结构来表示混合搜索返回的(中间)结果,并设计了一系列针对混合查询处理的高效算法。第九章总结了本文主要工作和成果并对语义搜索的进一步研究做了展望。
[Abstract]:The semantic web and semantic structure by giving clear information, makes the machine can not only display the information, can be more understanding, processing and integrating them. In recent years, with the linked open data and DBpedia projects in full swing, the number of semantic Web data source in a large graph structure of semantic data with RDF data model is release. From the Internet contains only between web pages and documents of the World Wide Web hyperlink change into a large number of descriptions of various entities and contains rich data between web relations. In this context, to Google for the company on behalf of the major search engines are based on the construction of knowledge map to improve the quality of the search, which opened the prelude of semantic search.
Different from the traditional document retrieval, semantic search structured semantic data need to deal with finer granularity, so they face greater challenges hitherto unknown. According to the original maturity of non structured Web document storage and indexing techniques are no longer applicable to RDF data. The existing ranking algorithms cannot be applied directly to the.SPARQL entity oriented semantic search and the associated query in data integration and support for heterogeneous semantic data source is a new problem. In addition, users are familiar with the query keywords support is essential for the promotion of semantic search.
This paper aims to systematically solve the semantic search for large scale RDF data challenge: to support the storage and index large graph data, support graph structure containing keyword query, support structured ranking entity centric, support for multiple data sources in heterogeneous data integration, and support friendly user interaction. The the main contents of each chapter and contributions are listed below:
The first chapter is the introduction, introduces the research background, summarizes the research status of semantic search both at home and abroad, and describes the major challenges faced by large-scale RDF in semantic search.
The second chapter first use information retrieval method to search the web data. Through the use of extended inverted index and single variable tree support efficient hybrid query processing. On this basis, I propose a ranking algorithm based on the relationship to return relevant entities, using the surface to allow the user to interactively browse structure hybrid query and block index based on incremental index updates.
The third chapter extends the second chapter structured query ability, proposed an efficient RDF query engine to execute more general SPARQL query. In addition, I through to estimate the execution cost of the query plan RDF statistics collection specific, and design a new query optimization algorithm to determine the optimal order of connection, will the SPARQL query graph into an optimal query plan.
The fourth chapter discusses the efficient query processing based on RDF graph patterns. This chapter introduces two kinds of mode selection strategy, through a heuristic rule to select the RDF subgraph, another use query history to select user preference graph structure. Based on the previous two chapters, I put forward efficient the index map based on the representation of the query plan through the pattern tree, and the SPARQL query into sub model to solve the coverage problem.
The fifth chapter puts forward a solution to the two stages of integration to solve the large-scale RDF data oriented semantic search in graph entity matching problem. By block to rapid screening of candidate entities to solve the scalability problem. Then, based on the characteristics of the local structure of entities in each block within the cluster, matching the results of the final. This work is the first attempt by a wide range of entity matching evaluation in large-scale scenarios using existing open data link sameAs three tuple.
The sixth chapter studies a novel keyword search and friendly interactive way, namely in the massive map data (especially RDF) on how to efficiently query translation. I propose a novel K subgraph search algorithm, the keyword query into structured queries, instead of directly calculating the query result. I also use the technology to generate the map contains only the pattern information aggregation, to accelerate the query translation process.
The seventh chapter introduces a data support web payment data integration on-demand search infrastructure. This chapter will be extended to the World Wide Web query translation in heterogeneous data source, the user keywords for a translation across multiple data sources of semantic structured query. In addition, I detail data for the World Wide Web distributed query processing technology, especially the mapping connection. It uses fifth chapters mentioned, large-scale entity method to calculate the data mapping layer in advance, and efficient to merge from heterogeneous data sources in the results.
The eighth chapter will be extended to the semantic search application scenarios including graph structure data, "and the corresponding semantic annotation of the hybrid network environment. Through the integration of information retrieval and database technology to build a can be extended to a large number of documents, graph data and semantic annotation database. In addition, I propose a novel data hybrid structure to represent the search results (middle), and has designed a series of hybrid algorithm for efficient query processing.
The ninth chapter summarizes the main work and achievements of this paper and makes a prospect for further research on semantic search.

【学位授予单位】：上海交通大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP391.1

【共引文献】