面向深层网络的查询规划策略的研究

发布时间：2018-01-21 08:18

本文关键词： 网络数据库查询能力可执行查询规划　出处：《哈尔滨工程大学》2012年硕士论文　论文类型：学位论文

【摘要】：当今，在线数据源(又称为网络数据库)越来越盛行，它们把数据隐藏在查询表单之后，从而形成了所谓的深层网络，和表层网络相比，表层网络的HTML页面是静态的，数据存储在文档中，而深层网络中的数据则是存储在后台数据库中，只有用户在表单上提交了查询后，它才生成动态HTML页面。根据BrightPlanet公司的统计表明，深层网络蕴含的信息量是表层网络的500倍，并且数量每年仍在飞快地增长，所以研究深层网络是必需的而且意义深远。由于Web数据库具有规模大、自治性、异构性、动态性以及不同的数据源具有不同有限的查询能力等特点，使得深层网络数据集成中的查询处理比传统的分布环境下的查询处理更具挑战性。为了解决数据源的自治异构问题，本文提出了一种数据源的描述方法。为了统计每个领域中属性词汇的大小，本文进行了一项调查：使用搜索引擎（例如：Google和bing）和Web目录(例如：invisibleweb.com)，收集了200个关于电影、书籍销售、汽车销售和音乐四个领域的数据源，其中每个领域含50个。调查结果表明：随着数据源的增多，它们的总共词汇数量收敛于一个相对较小的范围内。受此启发，为每个属性词汇建立倒排索引。此外，本文还提出了一个模块化的方法，，来为目标查询生成可执行的查询规划，它有五个模块共同工作完成这些任务：查询扩展、预处理、查询重写、查找相关数据源和生成模块。本文还设计了一种基于倒排索引高效生成逻辑规划的算法和一种为逻辑规划找出可执行次序的算法。在本文中，因为数据源存在访问限制，所以没有出现在逻辑规划中的数据源可能提供有用的绑定属性，可能有利于可执行查询规划的生成。此外，我们也表明了这些off-query访问在什么情况下是没必要的，以及在这些情况下只使用逻辑规划中的数据源就可以生成可执行的查询规划；也表明了这些off-query访问在什么情况下是必要的，我们提出了一个算法来找到和逻辑规划相关的数据源。最后实验表明本文的算法具有良好的效率、准确率和扩展性。
[Abstract]:Today, online data sources (also known as network databases) are becoming more and more popular, they hide data behind the query form, thus forming a so-called deep network, compared with the surface network. The HTML page of the surface network is static, the data is stored in the document, while the data in the deep network is stored in the background database, only after the user has submitted the query on the form. It generates dynamic HTML pages. According to BrightPlanet, deep networks contain 500 times as much information as surface networks and continue to grow rapidly each year. Therefore, it is necessary and far-reaching to study the deep network. Because Web database has the characteristics of large scale, autonomy, heterogeneity, dynamic and different data sources have different limited query ability and so on. The query processing in deep network data integration is more challenging than that in the traditional distributed environment. In order to solve the problem of autonomous heterogeneity of data sources, a description method of data sources is proposed in this paper. In order to measure the size of attribute vocabulary in each domain. This article conducted a survey using search engines (e.g.: Google and bing) and the Web directory (e.g.: invisibleweb.com). Collected 200 data sources on film, book sales, car sales and music, with 50 in each. The results show that: as data sources increase. Their total number of words converges to a relatively small range. Inspired by this, an inverted index is established for each attribute vocabulary. In addition, this paper proposes a modularization method. It has five modules working together to complete these tasks: query expansion, preprocessing, query rewriting. This paper also designs an efficient algorithm for generating logical programming based on inverted index and an algorithm for finding executable order for logic programming. In this article, data sources that do not appear in logical planning may provide useful binding properties that may facilitate the generation of executable query planning because of access restrictions to the data source. We have also shown where these off-query access is not necessary and where only the data sources in the logical planning can be used to generate executable query planning; We also show that these off-query access is necessary under what circumstances, we propose an algorithm to find the data source related to logical programming. Finally, experiments show that the algorithm has good efficiency, accuracy and expansibility.
【学位授予单位】：哈尔滨工程大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP311.13

【参考文献】