基于HBase的数据压缩技术研究

发布时间：2018-07-24 08:30

【摘要】：随着大数据技术的发展以及Hadoop等大数据平台的迅速普及与推广,生活中产生的数据量呈现爆炸性增长的趋势,数据种类呈现复杂化,存储方式呈现多样化。传统的基于行存储的大数据存储方式并不能够以较低的成本将大数据存储起来。与此同时,由于数据的访问频度的不同,对于不同访问级别的数据所采用的存储方式提出了新的要求。针对以上情况,结合大数据平台下的HBase数据库,本文对大规模数据环境下基于HBase的压缩存储技术进行了研究,主要的创新点如下:首先,提出一种基于访问频度的数据分类方法:根据一段时间内数据库文件的访问次数得到相应的访问频度,依据各数据文件的访问频度及相关阈值将数据文件划分为冷热数据并确定具体的访问级别。在此基础之上,提出基于数据访问级别的压缩策略选择方法:定义了确定数据样本的抽样方法,针对原有的压缩策略选择方法中先验知识未必可靠的缺陷,通过添加评估层及时调整先验知识,并在基于相邻参照区和基于统计列选择方法的基础上设计出HBase数据压缩策略选择方法,优化存储成本。仿真实验与结果表明,本文提出的方法不仅能够有效实现大数据的存储,同时还提高了数据的访问性能。其次,从数据迁移的角度,提出一种基于文件价值的数据迁移方法。首先,根据数据访问频度等因素计算出数据块文件的价值,由这个文件价值得到数据迁移的目的设备。同时改进了数据迁移技术,利用数据缓冲区和双缓冲队列解决了数据迁入迁出速率不匹配的问题,提高了数据迁移效率,节省了内存和时间消耗,最终实现了对大数据平台数据的存储优化。最后,基于以上的方法与理论,本文构建了基于数据压缩存储的原型系统并给出一个电子商务应用示范。系统的实现遵循需求分析、概要设计、详细设计及其实现等流程,完成压缩存储管理、数据迁移等功能模块,验证了本文提出算法的可行性,展现了基于HBase的压缩技术理论成果在动态场景下的应用效果。
[Abstract]:With the development of big data technology and the rapid popularization and popularization of big data platform such as Hadoop, the amount of data produced in life is increasing explosively, the data types are complicated, and the storage methods are diversified. Traditional big data storage based on row storage can not store big data at lower cost. At the same time, due to the different frequency of data access, new requirements for the storage of data at different access levels are put forward. In view of the above situation, combined with the HBase database under the big data platform, this paper studies the compressed storage technology based on HBase in the large-scale data environment. The main innovations are as follows: first, A data classification method based on access frequency is proposed. According to the number of visits to database files within a certain period of time, the corresponding access frequency is obtained. According to the access frequency and relevant threshold of each data file, the data file is divided into hot and cold data and the specific access level is determined. On this basis, a compression strategy selection method based on data access level is proposed: a sampling method for determining data samples is defined, and a prior knowledge may not be reliable in the original compression strategy selection method. By adding the evaluation layer to adjust the prior knowledge in time, and based on the adjacent reference area and the statistical column selection method, the HBase data compression strategy selection method is designed to optimize the storage cost. Simulation experiments and results show that the proposed method can not only effectively realize the storage of big data, but also improve the performance of data access. Secondly, from the point of view of data migration, a data migration method based on file value is proposed. Firstly, the value of the data block file is calculated according to the data access frequency and other factors, and the target equipment of data migration is obtained from the value of the file. At the same time, the technology of data migration is improved, the data buffer and double buffer queue are used to solve the problem of the mismatch of the data immigration rate, the efficiency of data migration is improved, and the memory and time consumption are saved. Finally, the storage optimization of big data platform data is realized. Finally, based on the above methods and theories, this paper constructs a prototype system based on data compression storage and gives a demonstration of e-commerce application. The realization of the system follows the flow of requirement analysis, outline design, detailed design and its implementation, and completes the compression storage management, data migration and other functional modules, which verifies the feasibility of the algorithm proposed in this paper. The application effect of compression theory based on HBase in dynamic scene is presented.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】