大数据量下的实时数据报表系统的设计与实现

发布时间：2018-04-21 13:24

本文选题：海量数据 + 大数据　；参考：《北京交通大学》2016年硕士论文

【摘要】：在智能餐饮系统的报表查询业务中,商家用户对营业数据的总结具有强烈的需求。报表系统的出现可以轻松的满足商家用户的这个需求。在对现有数据进行查询并生成报表数据时,存在着大量的针对多张数据库表进行随机查询的情况,而且大多包含表连接查询操作。在数据总量小于千万级别时,传统处理方式(直接查询数据库)的数据库响应时间能够被优化到十秒以内,但是当被查询的数据总量到达了几千万、上亿甚至十亿条记录时,传统处理方式无论如何优化或更改索引机制,不仅无法满足快速响应的多并发查询要求,而且查询数据时对数据库造成较大的压力。本人实习的公司的当前的处理方式是离线计算方式,即将数据导入到数据仓库(hive)中,进行离线计算,再对计算结果集进行查询,缺点是无法即席查询。而本文中介绍了另一种处理方式,通过引入分布式索引层解决上述问题,该处理方式被应用于许多大数据即席查询的场景中。在数据同步模块中,通过将许多关系型数据库中(MySQL)的表合并成一张宽表保证数据的完整性,并且利用搜索引擎(Solr)的快速查询的特点来提高查询效率。可以在数据量到达5000万, 每秒20并发访问的宽表查询场景中,实现2秒以内返回结果,并且查询全部成功。这样的查询速度以及数据的实时性都是传统处理方式(直接查询数据库)和离线计算方式无法完成的。论文主要详细阐述了数据全量同步模块、数据增量同步模块、报表业务模块等的设计与实现。只有数据全量同步模块和数据增量同步模块的配合才能使得分布式索引中的数据同时保持准确性和实时性,再加上报表业务模块根据业务需求对数据进行查询操作,即可给用户返回实时的报表数据。在全量数据同步模块中,通过Java多线程技术并对同步线程进行智能调度,大大提升了数据的同步速度。数据实时同步模块是基于阿里巴巴的MySQL数据同步组件和消息中间件开发的,此模块可确保增量数据可以实时的同步到分布式索引中去。本人独立完成了数据全量同步模块中的子表导入子模块、Hive绑定子模块、Hive宽表合成子模块、索引文件生成子模块,数据增量同步模块中的增量消息发布者子模块、增量消息消费者子模块以及报表业务模块中的会员子模块。目前该项目已经通过测试,正式上线到生产环境中,整体工作正常,可以为用户提供实时而又准确的报表数据。
[Abstract]:In the report query business of intelligent catering system, business users have a strong demand for summary of business data. The emergence of the report system can easily meet the needs of business users. When querying the existing data and generating report data, there are a large number of random queries for multiple database tables, and most of them contain table join query operations. When the total amount of data is less than ten million levels, the database response time of traditional processing (direct query database) can be optimized to less than 10 seconds, but when the total number of data being queried reaches tens of millions, hundreds of millions or even billions of records, No matter how the traditional processing method optimizes or changes the index mechanism, it can not only meet the requirement of multi-concurrent query with quick response, but also exert great pressure on the database when querying data. The current processing method of the company in which I work as an intern is the off-line calculation, that is, the data is imported into the data warehouse to calculate offline, and then the result set is queried, but the shortcoming is that it cannot be queried impromptu. In this paper, another processing method is introduced, which is solved by introducing distributed index layer, which is applied in many scenarios of big data ad hoc query. In the data synchronization module, the query efficiency is improved by merging the tables of MySQL into a wide table to ensure the integrity of the data and using the fast query characteristics of search engine Solr. The result can be returned within 2 seconds in the wide table query scene where the amount of data reaches 50 million and 20 times per second, and the query is all successful. This query speed and the real-time data can not be completed by traditional processing (direct query database) and offline computing. The design and implementation of data synchronization module, data increment synchronization module and report business module are discussed in detail. Only the cooperation of the data total synchronization module and the data increment synchronization module can make the data in the distributed index keep accurate and real-time simultaneously, and the report business module queries the data according to the business requirements. Can return real-time report data to the user. In the whole data synchronization module, the synchronization speed is greatly improved by Java multi-thread technology and intelligent scheduling of synchronous thread. The data real-time synchronization module is based on Alibaba's MySQL data synchronization component and message middleware. This module can ensure that the incremental data can be synchronized to the distributed index in real time. I have independently completed the sub-table import sub-module (Hive binding stator module), the Hive wide table synthesis sub-module, the index file generation sub-module, the incremental message publisher sub-module in the data total synchronization module. Incremental message consumer sub-module and report business module in the membership sub-module. At present, the project has passed the test, formally online to the production environment, the overall work is normal, can provide users with real-time and accurate report data.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.52

【相似文献】