高并发异构数据预处理系统的设计与实现

发布时间：2018-03-20 04:17

本文选题：大数据　切入点：异构数据　出处：《北京交通大学》2017年硕士论文　论文类型：学位论文

【摘要】：如今是一个数据的时代,随着大数据技术的发展,越来越多的行业需要使用这些新技术重新挖掘曾经积累的数据的价值,使其发挥出更大的作用,更好的为用户和企业服务。而这些数据大都是不完整、不一致的脏数据,无法直接进行数据挖掘,或者挖掘结果差强人意,故需要对数据进行预处理。本人有幸参与了某专利检索分析平台项目的开发,负责该平台中的底层异构数据预处理系统的设计与实现工作。本文从系统的项目背景及意义、国内外发展现状、系统需求分析、系统技术架构、系统功能结构、数据详细设计、系统详细设计与实现以及测试等方面对本系统进行了详细阐述。本系统为该平台提供专利数据预处理与存储服务。由于专利数据具有文件数量巨大且散碎、数据格式多样、数据语言多样、数据来源不一致等特点,而且需要在短时间内将专利数据加载入库,故而本文设计了索引数据的概念,将专利数据进行了封装,并基于Quartz框架设计并实现了多任务并行方式加载专利数据入库的功能,同时采用了五种不同的数据库满足数据存储功能。这五种数据库分别为检索数据库Hybase存储需要检索的数据;NoSq1数据库MongoDB存储供前台展示的半结构化数据;分布式文件系统存储海量的非结构化数据;缓存数据库Redis存储需要缓存的业务数据;关系型数据库MySQL存储数据流转过程中的控制、运维数据。并且这五种数据库均采用分布式方式进行部署,同时采用主从、双机热备、ZooKeeper等方式保证数据库的高可用性。本系统共有五个模块,分别是数据加载与更新模块、数据质检模块、数据修复模块、数据监控模块以及任务编排工具模块。其中数据加载与更新模块是重中之重,在加载数据入库时,将一个索引数据文件作为一个批次,采用批次的方式对数据进行分批加载入库;同时利用索引数据文件对专利数据文件的封装,从而可以使用多任务并行方式处理数据入库;并且将数据加载入库分为多个阶段进行,每个阶段都可以对数据进行校验、回滚。数据质检模块和数据监控模块协同工作,可以及时发现错误数据。数据修复模块负责对数据进行修复。任务编排工具模块负责自动拷贝索引数据文件。本系统已经交付使用并如期上线,而且已将积累的专利数据全部加载入库,提供用户使用。目前系统运行情况良好,同时为了提高该产品的竞争力,公司也在积极推广,相信会有更多的用户使用本产品。
[Abstract]:Today is an era of data. With the development of big data's technology, more and more industries need to use these new technologies to rediscover the value of the accumulated data and make it play a greater role. Better service for users and enterprises. And most of this data is incomplete, inconsistent, dirty data, can not be directly data mining, or the results of mining poor, Therefore, it is necessary to preprocess the data. I am fortunate to participate in the development of a patent retrieval and analysis platform project, responsible for the design and implementation of the underlying heterogeneous data preprocessing system in the platform. Development status at home and abroad, system demand analysis, system technical architecture, system function structure, data design, The detailed design and implementation of the system and the testing of the system are described in detail. The system provides the patent data preprocessing and storage services for the platform. Because of the large number of patent data and scattered files, the data format is diverse. It is necessary to load patent data into database in a short time, so the concept of index data is designed and the patent data is encapsulated. And based on the Quartz framework, the function of loading patent data into database in multi-task parallel mode is designed and implemented. At the same time, five different databases are used to satisfy the function of data storage. The five databases are used to store the data needed to be retrieved by the retrieval database Hybase / NoSq1 database MongoDB to store the semi-structured data displayed by the front desk. Distributed file system stores massive unstructured data; cache database Redis stores business data that needs to be cached; relational database MySQL stores data flow control in the process, Operation and maintenance of data. And these five databases are distributed deployment, while using master and slave, dual-computer hot standby ZooKeeper and other ways to ensure the high availability of the database. This system has five modules, data loading and updating module, Data quality check module, data repair module, data monitoring module and task orchestration tool module. The data loading and updating module is the most important. When loading data into the database, an index data file is regarded as a batch. The batch data is loaded into the database in batches, and the patent data file is encapsulated by the index data file, so that the multi-task parallel processing can be used to process the data into the database. And the data loading into the database is divided into several stages, each stage can check the data, roll back, data quality check module and data monitoring module work together, The data repair module is responsible for repairing the data. The task arrangement tool module is responsible for automatically copying the index data file. The system has been put into use and started on schedule. At present, the system is running well, and in order to improve the competitiveness of the product, the company is also actively promoting, I believe more users will use this product.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【相似文献】