非结构化数据统一存储平台的设计与实现

发布时间：2018-01-15 00:27

本文关键词：非结构化数据统一存储平台的设计与实现　出处：《浙江大学》2013年硕士论文　论文类型：学位论文

【摘要】：当今互联网上的数据正在呈现出迅速增长的发展趋势,这种趋势不仅仅体现在数据的数量上,同时也体现在数据的种类上。从传统的文本数据到如今的网络文档、图片、音频以及视频,互联网数据的主流逐渐从结构化数据转变为非结构数据,而这些日益增长并种类繁多的非结构化数据,为互联网数据的存储管理带来了新的挑战。本文首先研究了针对各类海量非结构化数据的存储问题所提出的解决方案,分析出各存储系统所存在的问题,从而总结出实现非结构化数据统一存储的关键问题。然后,针对具有海量、异构、关联等特征的非结构化数据的存储问题,提出了非结构化数据统一存储管理平台D-Ocean Repository,通过解决元数据管理、统一数据接口、异构存储以及数据的高可用性与一致性等关键问题,融合了HDFS, HBase, MySQL, XMLDB等各类存储设施,并通过异构存储设施的选择机制,解决各类数据的高效混合存储问题。同时,基于统一存储平台,本文设计并实现了一个非结构数据的批处理框架,利用数据统一存储的特性,解决了各类非结构化数据的统一处理问题,并基于MapReduce架构实现了数据的高效并行处理,使得计算资源与数据存储得到有机结合。最后,本文还实现了一个使用D-Ocean系统作为后台数据管理的互联网应用——互联网跨媒体新闻检索系统,用以证明非结构化数据统一存储平台的实用性,有助于未来面向更多非结构化数据的互联网应用实现。
[Abstract]:The data on the Internet is showing a trend of rapid growth, this trend is not only reflected in the amount of data, but also reflected in the types of data. Images from the traditional text data to web documents, audio and video, mainstream Internet data gradually transformed from structured data for unstructured data however, the increasing and many kinds of unstructured data, brings new challenges to the Internet data storage management.
In this paper, we first study the solutions proposed for the storage problem of all kinds of massive unstructured data, analyze the problems existing in each storage system, and summarize the key problems of unified storage of unstructured data.
Then, for a massive, heterogeneous, unstructured data storage problems associated with such features, the unstructured data storage management platform D-Ocean Repository, the solution of metadata management, unified data interface, high availability and consistency of the key issues of heterogeneous storage and data fusion, HDFS, HBase, MySQL, XMLDB other types of storage facilities, and through the selection mechanism of heterogeneous storage facilities, solve the problems of various efficient hybrid storage data.
At the same time, based on the unified storage platform, this paper designs and implements a non structured data batch processing framework, using the characteristics of data storage, to solve the problem of unified processing of unstructured data, and based on the MapReduce architecture to realize efficient parallel processing of data, making the computing resources and data storage are combined.
Finally, this paper implements a D-Ocean system as the background data management applications of the Internet - Internet media retrieval system, used to prove the viability of unstructured data storage platform, help for the future application of the realization of the Internet more non structured data.

【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】