溯源数据压缩存储研究

发布时间：2018-09-04 05:37

【摘要】：随着信息技术的发展，人们对信息的关注点不仅仅在数据本身，还需要知道数据的来源和演变等信息。这些数据的历史信息，也称为数据的溯源信息。在科学研究领域，数据溯源有广泛的应用，因为数据质量对科学家来说极其重要。其中有很多产生和收集溯源信息的系统，包括物理天文，化学，生物和海洋气象等研究领域。除此之外，溯源在数据重建，调试跟踪，安全和搜索等方面的应用也开始出现。但是在现有的诸多溯源系统中，溯源数据空间占用远远超过数据本身，在数据的内容与历史当中，处于次位的历史消耗了过多的资源，这就大大的降低了溯源系统的可用性和高效性。为了减少溯源数据的空间占用，而又不影响溯源完整性，Chapman等人提出了因式分解与继承（FAI）算法。FAI只是将溯源信息中的共同信息分析出来，进行优化。论文使用多维压缩算法，除了对溯源信息中共同的信息进行优化处理之外，还对数据本身的身份信息进行优化，同时挖掘溯源信息内在的相似性，将编码之后的溯源祖先信息使用web算法进行优化，进一步降低溯源祖先信息的存储开销，而且保证溯源信息查找性能不受影响，这是从微观层面对溯源数据进行优化存储。另外，从宏观层面来看，溯源数据随着时间无限增长，导致溯源空间和查询时间开销无限增长，针对这个问题，论文以PASS系统为研究实例，，采用溯源信息分割，建立索引，压缩分割溯源文件等方式，利用溯源数据的局部性原理，改进了PASS系统的溯源存储和查找机制。实验表明，多维压缩算法无论在存储空间占用，还是身份或祖先信息查询方面都要好于FAI算法；在PASS系统的溯源存储优化中，使用数据库分割，建立索引，压缩分割的主数据库文件等方式，与原有的溯源存储方法比较，在空间占用和查询时间的开销方面都要好于原有的方法。
[Abstract]:With the development of information technology, people pay more attention not only to the data itself, but also to the source and evolution of the data. The historical information of these data, also known as data traceability information. Data traceability is widely used in scientific research because data quality is very important to scientists. There are many systems for generating and collecting traceability information, including physics, astronomy, chemistry, biology and marine meteorology. In addition, traceability in data reconstruction, debugging and tracking, security and search applications are also beginning to appear. However, in many existing traceability systems, the traceability data space occupies far more than the data itself, and in the data content and history, the history at the secondary level consumes too much resources. This greatly reduces the availability and efficiency of traceability systems. In order to reduce the space occupation of traceability data without affecting the traceability integrity, Chapman et al proposed a factorization and inheritance (FAI) algorithm, which only analyzes the common information in the traceability information and optimizes it. In this paper, the multi-dimensional compression algorithm is used to optimize the identity information of the data itself, in addition to the common information in the traceability information, at the same time, the similarity of the traceability information is mined. The web algorithm is used to optimize the coded traceability ancestor information to further reduce the storage cost of traceability ancestor information and to ensure that the traceability information lookup performance is not affected. This is to optimize the storage of traceability data from the micro level. In addition, from the macro level, traceability data increases infinitely with time, which leads to infinite increase of traceability space and query time. Aiming at this problem, this paper takes PASS system as an example, uses traceability information segmentation to build index. Based on the principle of locality of traceability data, the traceability storage and search mechanism of PASS system is improved by compressing segmented traceability files. Experiments show that the multidimensional compression algorithm is better than the FAI algorithm in terms of storage space occupation, identity or ancestor information query, database segmentation and indexing are used in the traceability storage optimization of PASS system. Compared with the original traceability storage method, the compressing and partitioning of the main database file is better than the original method in terms of the cost of space occupation and query time.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【共引文献】