基于HBase的数据管理系统设计与实现
发布时间:2018-10-16 21:34
【摘要】:随着互联网的快速发展,应用所产生的数据量越来越大。分布式数据库HBase对海量数据的管理得到了广泛应用。许多企业希望将原本存放在关系型数据库中的数据迁移到分布式数据库HBase中,并在HBase上对数据进行管理。因此研究在HBase上提供一个数据管理系统具有重要意义。在分析基于HBase的数据管理系统设计目标基础上,给出了系统的整体设计方案,包括两大功能:将关系型数据库中的模式和数据迁移到HBase中,使用SQL语句管理HBase中的数据。关系型数据库模式和数据迁移功能将关系型数据库表的列信息、索引信息、主外键信息存储到HBase的元数据表中。表数据迁移任务分割为多个小任务并尽可能地平分给集群中机器来执行,同时根据主外键信息对数据进行冗余存储,根据索引信息在HBase中创建索引表并记录索引数据。使用SQL语句管理HBase中数据,重点优化多表连接查询。将多表连接查询任务根据HBase的特点分解为若干子多表连接查询,使用HBase的协处理器并发的执行子多表连接查询。子多表连接查询中根据连接条件的特点优化表的连接顺序,利用迁移流程产生的冗余数据、索引数据提高连接查询效率。子多表连接查询的中间数据利用哈希表和多叉树存储来降低内存开销。在客户端合并多个子多表连接查询返回的结果。通过对基于HBase数据管理系统的实验测试,表明该系统能高效的迁移表模式和数据,对迁移后的数据能正确的管理,且在多表连接查询时相比与Hive具有较好性能。
[Abstract]:With the rapid development of the Internet, the amount of data generated by the application is increasing. Distributed database HBase has been widely used in the management of massive data. Many enterprises want to migrate the data stored in the relational database to the distributed database HBase and manage the data on the HBase. Therefore, it is of great significance to provide a data management system on HBase. Based on the analysis of the design objectives of the data management system based on HBase, the overall design scheme of the system is presented, which includes two functions: migrating the schema and data from the relational database to the HBase, and managing the data in the HBase by using the SQL statement. Relational database schema and data migration function store the column information, index information and primary foreign key information of relational database table in HBase metadata table. The task of table data migration is divided into several small tasks and distributed equally to the machines in the cluster as far as possible. At the same time, the data is stored redundant according to the primary foreign key information, and the index table is created in HBase according to the index information and the index data is recorded. Use SQL statements to manage data in HBase, focusing on optimizing multi-table join queries. According to the characteristics of HBase, the task of multi-table join query is decomposed into several sub-multi-table join queries, and the co-processor of HBase is used to execute the sub-multi-table join query. In order to optimize the join order of subtable join query according to the characteristics of join condition, the redundant data generated by migration process is used to improve the efficiency of join query. The intermediate data of sub-table join query uses hash table and multi-tree storage to reduce memory overhead. Results returned by merging multiple child multiple table join queries on the client side. The experimental results based on HBase data management system show that the system can efficiently migrate the table pattern and data, manage the migrated data correctly, and have better performance than Hive in multi-table join query.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
[Abstract]:With the rapid development of the Internet, the amount of data generated by the application is increasing. Distributed database HBase has been widely used in the management of massive data. Many enterprises want to migrate the data stored in the relational database to the distributed database HBase and manage the data on the HBase. Therefore, it is of great significance to provide a data management system on HBase. Based on the analysis of the design objectives of the data management system based on HBase, the overall design scheme of the system is presented, which includes two functions: migrating the schema and data from the relational database to the HBase, and managing the data in the HBase by using the SQL statement. Relational database schema and data migration function store the column information, index information and primary foreign key information of relational database table in HBase metadata table. The task of table data migration is divided into several small tasks and distributed equally to the machines in the cluster as far as possible. At the same time, the data is stored redundant according to the primary foreign key information, and the index table is created in HBase according to the index information and the index data is recorded. Use SQL statements to manage data in HBase, focusing on optimizing multi-table join queries. According to the characteristics of HBase, the task of multi-table join query is decomposed into several sub-multi-table join queries, and the co-processor of HBase is used to execute the sub-multi-table join query. In order to optimize the join order of subtable join query according to the characteristics of join condition, the redundant data generated by migration process is used to improve the efficiency of join query. The intermediate data of sub-table join query uses hash table and multi-tree storage to reduce memory overhead. Results returned by merging multiple child multiple table join queries on the client side. The experimental results based on HBase data management system show that the system can efficiently migrate the table pattern and data, manage the migrated data correctly, and have better performance than Hive in multi-table join query.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
【相似文献】
相关期刊论文 前10条
1 杨武,文守逊;汇总型多表连接查询的一种优化方法[J];计算机系统应用;2000年01期
2 徐帆;汇总型多表连接查询的一种优化方法[J];计算机工程与设计;2002年10期
3 张雷;唐桂芬;苏冉冉;;基于通用空间连接图的适应性多元空间连接查询[J];计算机光盘软件与应用;2013年13期
4 彭建平,王变琴;再探多连接查询优化方法[J];中山大学学报(自然科学版);2001年02期
5 刘宇,孙莉,田永青;并行空间连接查询处理[J];上海交通大学学报;2002年04期
6 王果,徐仁佐;结合哈希过滤的一种改进多连接查询优化算法[J];计算机工程;2004年07期
7 陈恕胜;刘卫东;;基于图的适应性多连接查询优化算法[J];计算机工程;2009年10期
8 郭聪莉;朱莉;李向;;基于蚁群算法的多连接查询优化方法[J];计算机工程;2009年10期
9 王,
本文编号:2275687
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2275687.html