溯源数据精简方法研究

发布时间：2018-02-14 08:34

本文关键词： 数据溯源数据精简中心性分析图聚类　出处：《山东大学》2017年硕士论文　论文类型：学位论文

【摘要】：数据溯源是对目标数据衍生前的原始数据及其演变过程的追溯、重现与展示。因其在监测数据流失、完成数据重建以及验证数据的安全与可信性等方面具有独特的优势,在大数据工程和信息安全领域具有广阔的应用前景。但是,自溯源系统出现以来,溯源数据的规模问题一直是制约其应用的瓶颈。为保证目标数据的可溯源性,溯源数据的规模常常远大于目标数据,而对于面向大数据工程的溯源系统,这个问题更为突出。规模巨大的溯源数据不仅严重降低了溯源查询的效率,使其存储、计算和管理成本激增,还因数据关联过于复杂、细密,使溯源结果的理解更加困难,极大降低了数据溯源的质量,并直接影响到数据溯源技术的推广应用。目前,国内外关于精简溯源数据主要采用的基于去冗压缩和消噪过滤等方法不能从根本上解决溯源数据规模巨大的问题,本文基于溯源数据的特点以及溯源图结构,从分离冷数据和细粒度关联数据的角度,对大规模溯源数据进行粗粒度化,提出精简溯源数据规模的有效方法。本文的主要工作包括:1.基于类型的溯源数据分层精简方法的研究,利用数据项之间依赖关系的传递性重构数据对象间的依赖关联,将溯源数据按其类型进行分层划分,对其中粒度较小、使用频度较低的"冷数据"层进行剥离,并以此简化溯源数据,提高溯源效率。2.基于中心性差值的溯源数据精简方法的研究,根据数据节点中心性差值对任务层数据进行边界划分,通过提取任务内影响力较高的边界数据节点作为关键溯源,实现溯源数据规模的精简。3.基于相关性聚类的溯源数据精简方法的研究,即:将数据按照相关性进行粗粒度聚类,对描述任务细节的非边界数据进行分级存储或修剪,从溯源数据粗粒度聚类角度实现溯源数据的精简。本文的创新点为:1.提出一种基于类型的溯源数据分层精简方法,该方法将溯源数据按其对象类型进行分层划分后,剥离使用频度较低的"冷数据"层,以此实现数据溯源规模精简。2.提出一种基于中心差值的溯源数据精简方法,该方法利用中心性差值识别粗粒度任务边界,通过提取任务内影响力较高的边界数据节点作为关键溯源,实现溯源数据规模的精简。3.提出一种基于相关性聚类的溯源数据精简方法,该方法根据溯源数据之间的相关性,实现溯源数据的聚类,通过对聚类后内关联数据的剥离,实现溯源数据的精简。本文基于哈佛大学PASSv2标准溯源Trace数据集,对所提出的溯源数据精简方法分别进行了实验,实验结果验证了所提出方法的可行性和有效性。
[Abstract]:Data traceability is the tracing, reproducing and displaying of the original data and its evolution process before the derivation of the target data, because of its unique advantages in monitoring the data loss, completing the data reconstruction and verifying the security and credibility of the data. Big data has a broad application prospect in the field of engineering and information security. However, since the emergence of traceability system, the scale of traceability data has been the bottleneck of its application. The scale of traceability data is often much larger than that of target data, but for the traceability system oriented to big data project, this problem is more prominent. The large scale traceability data not only reduces the efficiency of traceability query, but also makes it stored. The surge in computing and management costs, as well as the complexity and fineness of data association, make it more difficult to understand the traceability results, greatly reduce the quality of data traceability, and directly affect the popularization and application of data traceability technology. At home and abroad, the methods of reducing traceability data based on de-redundancy compression and denoising filtering can not fundamentally solve the problem of large scale traceability data. This paper is based on the characteristics of traceability data and traceability graph structure. From the angle of separating cold data from fine-grained correlation data, coarse-grained large-scale traceability data is coarse-grained. This paper proposes an effective method for reducing the scale of traceability data. The main work of this paper includes: 1.The hierarchical reduction method of traceability data based on type is studied, and the transitive relation between data items is used to reconstruct the dependency relation between data objects. The traceability data is stratified according to its type, and the "cold data" layer with smaller granularity and low frequency is used to simplify the traceability data. Improving traceability efficiency. 2. Research on the method of reducing traceability data based on centrality difference, divide the boundary of task layer data according to the centrality difference of data node, and extract the influential boundary data node in the task as the key traceability. Reduction of traceability data scale. 3. Research on traceability data reduction method based on correlation clustering, that is, coarse-grained clustering of data according to correlation, hierarchical storage or pruning of non-boundary data describing task details. From the point of view of coarse-grained clustering of traceability data, the innovation of this paper is: 1.This paper presents a typology based hierarchical reduction method for traceability data, which divides traceability data into layers according to their object types. In order to reduce the scale of data traceability, a traceability data reduction method based on central difference is proposed, in which the coarse-grained task boundary is identified by centrality difference. By extracting the influential boundary data node as the key traceability, the traceability data scale is reduced. 3. A traceability data reduction method based on correlation clustering is proposed, which is based on the correlation between traceability data. To realize the clustering of traceability data, the traceability data can be reduced by stripping the associated data after clustering. Based on the traceability Trace dataset of Harvard University PASSv2 standard, this paper makes experiments on the proposed traceability data reduction method. The experimental results show that the proposed method is feasible and effective.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;TP309

【相似文献】