一种面向数据仓库周期性查询的增量优化方法
发布时间:2018-10-20 09:52
【摘要】:大数据蕴含着巨大的价值.分析类查询是获取数据价值的一种重要手段.为及时把握分析结果的变化,查询需要周期性地重复.为此,将不可避免地引入对旧数据的重复分析.目前,以重用历史数据的中间结果、优化冗余计算为核心思路的增量分析技术,存在用户透明性不佳、对历史结果存储位置的选择不够智能化等问题,对周期性增量查询的优化效果有限.从兼顾用户透明性和优化收益的角度出发,设计了一种以语义规则为指导的增量优化方法.该方法扩展了增量描述语法,以查询操作符的操作语义和输出语义指导对历史数据存储、合并位置的选择,再根据代价模型和物理查询任务的划分位置对选择结果进行调整,生成优化后可以在分布式计算框架(如Map Reduce)周期性调度执行的物理查询任务.以Apache Hive为基础,实现了上述方法的原型Hive Inc.实验结果表明:对于扩展了增量语法描述的TPC-H测试集,Hive Inc相对于优化前可以获得平均2.93倍、最高5.78倍的加速;与经典的优化技术Inc MR、Dryad Inc相比,分别可以获得1.69倍和1.61倍的加速.
[Abstract]:Big data has great value. Analysis class query is an important means to obtain data value. In order to grasp the changes of the analysis results in time, the query needs to be repeated periodically. Therefore, repeated analysis of old data will inevitably be introduced. At present, the incremental analysis technology, which takes the reuse of intermediate results of historical data and the optimization of redundant calculation as the core idea, has some problems such as poor user transparency and lack of intelligence in the selection of storage locations for historical results. The optimization effect of periodic increment query is limited. An incremental optimization method guided by semantic rules is designed from the perspective of considering both user transparency and revenue optimization. The method extends the incremental description syntax and uses the operation semantics and output semantics of the query operator to guide the historical data storage, merge location selection, and then adjust the selection results according to the cost model and the partition position of the physical query task. Physical query tasks that can be scheduled periodically by distributed computing frameworks such as Map Reduce) are generated after optimization. Based on Apache Hive, the prototype Hive Inc. of the above method is implemented. The experimental results show that the, Hive Inc of the TPC-H test set with extended incremental syntax description can achieve an average acceleration of 2.93 times and a maximum of 5.78 times compared with the prior optimization, and 1.69 times and 1.61 times of acceleration compared with the classical optimization technique Inc MR,Dryad Inc, respectively.
【作者单位】: 计算机体系结构国家重点实验室(中国科学院计算技术研究所);中国科学院大学;
【基金】:国家高技术研究发展计划(863)(2015AA011505) 国家自然科学基金(61303053,61402445,61402303,61521092)~~
【分类号】:TP311.13
本文编号:2282789
[Abstract]:Big data has great value. Analysis class query is an important means to obtain data value. In order to grasp the changes of the analysis results in time, the query needs to be repeated periodically. Therefore, repeated analysis of old data will inevitably be introduced. At present, the incremental analysis technology, which takes the reuse of intermediate results of historical data and the optimization of redundant calculation as the core idea, has some problems such as poor user transparency and lack of intelligence in the selection of storage locations for historical results. The optimization effect of periodic increment query is limited. An incremental optimization method guided by semantic rules is designed from the perspective of considering both user transparency and revenue optimization. The method extends the incremental description syntax and uses the operation semantics and output semantics of the query operator to guide the historical data storage, merge location selection, and then adjust the selection results according to the cost model and the partition position of the physical query task. Physical query tasks that can be scheduled periodically by distributed computing frameworks such as Map Reduce) are generated after optimization. Based on Apache Hive, the prototype Hive Inc. of the above method is implemented. The experimental results show that the, Hive Inc of the TPC-H test set with extended incremental syntax description can achieve an average acceleration of 2.93 times and a maximum of 5.78 times compared with the prior optimization, and 1.69 times and 1.61 times of acceleration compared with the classical optimization technique Inc MR,Dryad Inc, respectively.
【作者单位】: 计算机体系结构国家重点实验室(中国科学院计算技术研究所);中国科学院大学;
【基金】:国家高技术研究发展计划(863)(2015AA011505) 国家自然科学基金(61303053,61402445,61402303,61521092)~~
【分类号】:TP311.13
【相似文献】
相关期刊论文 前4条
1 朱华吉;吴华瑞;;空间数据库更新过程中增量信息产生原因分析[J];成都理工大学学报(自然科学版);2007年05期
2 张求喜;;道路线状要素增量更新[J];公路;2014年04期
3 朱丽云;温慧敏;;交通路网数据自动增量识别与技术更新[J];交通信息与安全;2009年02期
4 姬存伟;武芳;巩现勇;焦洋洋;;居民地要素增量信息表达模型研究[J];武汉大学学报(信息科学版);2013年07期
相关会议论文 前1条
1 林艳;刘万增;陈军;;GIS数据库增量更新的模型研究[A];中国测绘学会九届四次理事会暨2008年学术年会论文集[C];2008年
相关硕士学位论文 前2条
1 李英忠;基于J2EE的企业增量管理平台的设计与实现[D];北京交通大学;2013年
2 孙英杰;基于变化信息文件的增量更新方法研究[D];中南大学;2008年
,本文编号:2282789
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2282789.html