基于低秩和稀疏表示模型的视频目标提取和跟踪研究

发布时间：2018-10-12 17:43

【摘要】：视频目标的提取与跟踪是计算机视觉领域的基本问题,也是智能视频监控系统的关键与核心技术。目前,尽管这方面的研究取得了令人瞩目的进展,但是,由于数据、场景、环境的复杂性,视频目标的提取与跟踪仍是挑战性极大的研究课题。本文围绕上述复杂因素,从低秩和稀疏表示模型的角度出发,对视频目标的提取和跟踪问题展开讨论,分别研究了基于正则化低秩表示模型的视频目标分割、基于加权低秩分解的多模态运动目标检测、基于图像块表达和动态图学习的目标跟踪以及基于协同稀疏表示模型的多模态目标跟踪。在视频目标提取方面,针对视频数据中类内差异性和类间相似性较大,以及视频噪声的存在,提出一种基于正则化低秩表示模型的视频目标分割框架。以超体素为图结点,使用低秩表示模型优化它们之间的相似性关系,可以有效地克服稀疏大噪声和稠密高斯噪声的干扰。为了提高超体素之间的判别性,在稀疏表示模型中引入判别性重复先验对稀疏表示系数矩阵进行正则化,即正则化稀疏表示模型。由于视频数据一般是非常庞大的,因此,提出一种基于次优低秩分解的优化算法高效地求解提出的模型,并在理论上保证了其收敛性。同时,提出流处理方法,使得分割方法能够在有限的计算和存储资源中处理无限长的视频。为了验证有效性,本文分别把优化的超体素的相似性关系应用于无监督的和交互式的视频目标分割任务,均取得了较优的性能。针对场景和环境的复杂性,本文提出了一种基于加权低秩分解的多模态运动目标检测的通用框架。由于可见光谱信息受复杂场景、光照和雾霾等因素的干扰较大,因此,引入热红外光谱信息对其进行补充。具体地,通过为每个模态引入一个质量权重,把不同模态的具有低秩结构的背景数据、多模态共享的稀疏前景模板以及前景、背景像素点的连续性约束进行联合建模,使得能够自适应地融合多模态数据,进而鲁棒地检测运动目标。为了进一步地改善算法检测效率并保持精度,提出一种有效的基于保边滤波的加速算法,使得算法效率达到近实时。此外,构建了一个包括25个视频对的多模态运动目标检测平台,弥补了该领域缺乏标准评价体系的不足,促进相关领域的研究发展。在目标跟踪方面,为了解决基于检测的跟踪框架中的模型漂移问题,本文提出了一种基于图像块的动态图学习方法,消弱目标表达中的背景干扰。首先,把跟踪矩形框划分成不重叠的小图像块,并为每个图像块分配一个权重,用来表示图像块对于目标的重要性。由于传统的8-邻域图忽略了图的全局结构以及局部线性关系,因此,以图像块为图结点,利用它们之间的全局低秩结构、稀疏局部线性关系以及边权的非负性动态地学习图的结构,同时,以半监督的方式联合地优化图像块的权重向量。其次,为了提高跟踪方法的时效性,提出一个实时的优化算法求解提出的模型。最后,把优化的权重向量嵌入到目标跟踪和模型更新中,极大地提高跟踪性能。为了克服场景和环境复杂性带来的挑战,本文在贝叶斯滤波框架下提出了一种基于协同稀疏表示模型的多模态目标跟踪方法。传统的多模态目标跟踪方法把每个模态平等地对待,如果某个模态的信息有非常大的歧义性,则会对最终的跟踪结果造成影响。因此,本文自适应地融合不同的模态,即在稀疏表示模型中为每个模态引入一个质量权重,以此实现稳健地跟踪。特别地,每个模态的权重由该模态的重构误差以及目标与背景的判别性来确定的,并和稀疏表示系数一起联合优化。此外,由于该问题缺少标准的评测平台,因此,构建了一个标准的多模态目标检测平台,包含50个配准的视频对、22个基准方法和2种度量方式。该平台为该问题及相关领域提供了一个标准的评价体系,有助于这方面的研究。
[Abstract]:The extraction and tracking of video object is the basic problem in the field of computer vision, and it is also the key and core technology of intelligent video monitoring system. At present, although the research in this aspect has made remarkable progress, because of the complexity of data, scene and environment, the extraction and tracking of video object is still a very challenging research topic. From the viewpoint of low rank and sparse representation model, this paper discusses the extraction and tracking of video object from the viewpoint of low rank and sparse representation model, studies the video target segmentation based on regularization low rank representation model, and based on weighted low rank decomposition multi-modal motion target detection, Object tracking based on image block representation and dynamic graph learning and multi-modal target tracking based on collaborative sparse representation model. In terms of video target extraction, a video target segmentation framework based on regularization low rank representation model is proposed for video data. Using supervoxel as graph node, using low rank representation model to optimize the similarity relation among them, we can effectively overcome the interference of sparse large noise and dense Gaussian noise. In order to improve the discriminability between supervoxels, the sparse representation coefficient matrix is regularized in sparse representation model, that is, regularization sparse representation model. Because the video data is usually very large, an optimization algorithm based on sub-optimal low-rank decomposition is proposed to solve the proposed model efficiently, and its convergence is guaranteed theoretically. At the same time, a stream processing method is proposed so that the segmentation method can process unlimited long video in limited computing and storage resources. In order to verify the validity, this paper applies the similarity relation of the optimized supervoxel to the unsupervised and interactive video object segmentation task. In view of the complexity of scene and environment, a universal framework for multi-modal motion target detection based on weighted low-rank decomposition is proposed in this paper. Since the visible spectrum information is affected by complex scenes, illumination and haze, the thermal infrared spectral information is introduced to supplement it. in particular, by introducing a quality weight for each modality, combining background data with a low rank structure, a sparse foreground template of multi-modality sharing and a foreground and a continuity constraint of a background pixel point are jointly modeled, so that multi-modal data can be adaptively fused, and then the moving object is detected by the rod. In order to improve the algorithm detection efficiency and maintain the accuracy, an efficient algorithm based on edge-preserving filtering is proposed, which makes the efficiency of the algorithm close to real-time. In addition, a multi-modal motion target detection platform including 25 video pairs is constructed, which makes up for the lack of standard evaluation system in this field and promotes research and development in related fields. In the aspect of target tracking, in order to solve the problem of model drift in the detection-based tracking framework, a dynamic graph learning method based on image block is proposed in this paper. First, the tracking block is divided into non-overlapping small image blocks, and a weight is allocated for each image block to represent the importance of the image block for the object. because the traditional 8-neighborhood graph ignores the global structure of the graph and the local linear relationship, the structure of the graph is dynamically learned by using the global low-rank structure, the sparse local linear relation and the non-negative dynamic learning graph of the edge right between the image blocks as the graph nodes, and meanwhile, the weight vector of the image block is optimized in a semi-supervised manner. Secondly, in order to improve the timeliness of tracking method, a real-time optimization algorithm is proposed to solve the proposed model. Finally, the optimized weight vector is embedded into the target tracking and model updating, so that the tracking performance is greatly improved. In order to overcome the challenges of scene and environment complexity, a multi-modal target tracking method based on collaborative sparse representation model is proposed in this paper. The traditional multi-modal target tracking method treats each modality equally, and if the information of a certain modality has very large ambiguity, the final tracking result is affected. Therefore, a robust tracking is achieved by adaptively fusing different modalities, i.e., introducing a quality weight for each modality in the sparse representation model. In particular, the weight of each modality is determined by the reconstruction error of the modality and the determination of the target and background, and is jointly optimized with the sparse representation coefficients. In addition, since the problem lacks the standard evaluation platform, a standard multi-modal target detection platform is constructed, including 50 matching video pairs, 22 reference methods and two measurement methods. The platform provides a standard evaluation system for the problem and related fields, which contributes to the research in this field.
【学位授予单位】：安徽大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.41

【相似文献】