基于分层抽样的重叠深网数据源选择

发布时间：2018-11-27 11:55

【摘要】：深网查询在Web上众多的应用,需要查询大量的数据源才能获得足够的数据,如多媒体数据搜索、团购网站信息聚合等.应用的成功,取决于查询多数据源的效率和效果.当前研究侧重查询与数据源的相关性而忽略数据源之间的重叠关系,使得不同数据源上相同结果的数据被重复查询,增加了查询开销及数据源的工作负载.为了提高深网查询的效率,提出一种元组水平的分层抽样方法来估计和利用查询在数据源上的统计数据,选择高相关、低重叠的数据源.该方法分为两个阶段:离线阶段,基于元组水平对数据源进行分层抽样,获得样本数据;在线阶段,基于样本数据迭代地估计查询在数据源上的覆盖率和重叠率,并采用一种启发式策略以高效地发现低重叠的数据源.实验结果表明,该方法能够显著提高重叠数据源选择的精度和效率.
[Abstract]:In order to obtain enough data such as multimedia data search group purchase website information aggregation and so on it is necessary to query a large number of data sources in order to obtain enough data for many applications of Deep Web query on Web. The success of the application depends on the efficiency and effectiveness of querying multiple data sources. The current research focuses on the correlation between the query and the data source and neglects the overlapping relationship between the data sources, which makes the data with the same result on different data sources repeatedly queried, which increases the query overhead and the workload of the data sources. In order to improve the efficiency of deep network query, a hierarchical sampling method at the level of tuple is proposed to estimate and utilize the statistical data of query on the data source to select the data source with high correlation and low overlap. The method is divided into two stages: off-line stage, stratified sampling of data source based on tuple level to obtain sample data; In the online stage, the query coverage and overlap rate on the data source are estimated iteratively based on the sample data, and a heuristic strategy is adopted to find the low overlap data source efficiently. Experimental results show that this method can significantly improve the accuracy and efficiency of overlapping data source selection.
【作者单位】：武汉大学计算机学院;软件工程国家重点实验室(武汉大学);
【基金】：国家自然科学基金(61232002,61202035) 湖北省科技支撑计划(2015BAA127)~~
【分类号】：TP311

【参考文献】