基于视觉认知机理的图像语义内容获取研究

发布时间：2018-06-16 00:29

本文选题：超像素分割 + 显著性检测　；参考：《北京科技大学》2016年博士论文

【摘要】：为了利用计算机模拟人类的视觉认知机理,实现人类或其他高等生物的视觉功能,达到对反映客观世界的图像场景的感知、识别和理解,就需要根据图像的视觉内容来获取人类能够理解的语义内容。由于在视觉感知初始阶段,视觉注意往往会快速定位在一些具有一定语义信息的局部区域或者目标上,这些区域或者目标正是语义内容所描述的对象；同时,随着局部区域的快速定位,视觉系统会根据这些区域之间的形状及局部特征的视觉差异性,自动聚焦场景中的主要或者显著性目标进行感知；最后认知系统会围绕聚焦的显著目标及其相关联信息而展开,从而形成针对整个场景描述的语义内容及感知。因此,本文首先利用一种改进的超像素分割方法,提取图像中具有一定语义信息的局部区域：然后结合局部区域的视觉特征,构建显著性目标或区域检测模型,获取图像中的中高级语义信息-显著性及显著性视觉内容：最后以显著目标或者区域及其相关信息为视觉引导,利用神经网络通过深度学习建立起图像的自动语义标注模型,获取场景的最终高级语义描述内容。具体工作如下：1)在局部区域的提取过程中,提出一种基于SLIC0融合纹理信息的超像素分割方法。此方法在分割过程中融合能够反映图像中目标及区域固有外轮廓及边界的纹理特征。同时采用围绕种子像素点搜索其周围圆形区域的策略,从而在进一步提高处理效率的基础上使得分割的超像素可以更加逼近图像中局部区域或者目标的外轮廓,保证相对快速分割出具有规则大小及形状,以及其边界符合目标及区域的外轮廓的超像素。最后通过在公共数据集BSDS500上进行实验及量化比较分析,结果表明本文所提的SLICO-t超像素分割方法优越于目前评价很高的SLICO方法。其中在边界召回率方面,相对比较稳定的超过了SLICO方法的8到9个百分点。2)在显著目标或者区域检测过程中,首先提出一种针对超像素局部区域信息进行描述的稀疏直方图模型。这种直方图模型整合描述了局部区域的局部纹理、颜色及形状信息。然后在此基础上提出一种图像显著性检测方法,使得检测的显著目标或者区域清晰完整地从背景场景中分离开来,同时,显著性目标或者区域具有相对完整的外轮廓及形状特征,以及局部纹理细节信息。最后通过在Achanta等人提供的公开测试数据集上进行实验及量化评估,并与目前流行的五种显著性检测方法比较,结果表明本文提出的显著性检测方法在精准率、平均F-measure以及绝对均值错误率方面优于其它几种显著性检测方法。3)在图像的自动标注及语义内容获取过程中,本文首先以场景中显著目标的视觉特征为先验知识,感知场景中的显著目标或者区域。然后在已经感知的显著目标或者区域的基础上再次利用整体局部区域特征进行进一步映射增强。这种双层映射过程,使用两种视觉特征进行训练学习,它是一种基于神经网络的在自我学习过程中进行决策层面融合的过程。同时,在图像与文本语义信息的encoding过程中,借鉴使用已经被成功验证的保序映射的方式进行映射,从而比较准确的挖掘揭示图像与语义文本描述之间的潜在关系。最后通过在三种公共数据集Flickr8k,Flickr30k及MSCOCO上分别进行训练、验证及测试,并应用于图像语义的双向检索进行评估衡量。结果表明本文所提方法相比目前公开发表的方法,在不同召回率方面(Recall@K(k=1,5,10))都有了进一步提高,并且获取的语义内容更加符合人类的认知习惯,显得自然流畅。同时,本文的研究成果对图像局部特征表征及提取、图像分割以及更广泛领域的图像理解相关方面的研究具有重要的参考价值。
[Abstract]:In order to use the computer to simulate the human visual cognitive mechanism and realize the visual function of human or other higher organisms, to achieve the perception, recognition and understanding of the image scene reflecting the objective world, it is necessary to obtain the semantic internal capacity that human can understand according to the visual content of the image. It is often located quickly in some local areas or targets with certain semantic information. These areas or targets are the objects described by the semantic content. At the same time, with the rapid localization of local areas, the visual system will automatically focus the main scene in the scene according to the visual differences between the shapes and local characteristics between these regions. At the end of this paper, a modified super pixel segmentation method is used to extract the local region with certain semantic information in the image. Then, combining the visual features of the local area, a significant target or regional detection model is constructed to obtain the middle and advanced semantic information in the image - significance and significant visual content: finally, the visual guidance is guided by the significant target or the region and its related information, and the automatic semantic annotation of the image is established by using the neural network through depth learning. The final high-level semantic description of the scene is obtained. The specific work is as follows: 1) in the process of extracting local regions, a super pixel segmentation method based on SLIC0 fusion texture information is proposed. This method combines the texture features that can reflect the target and the region with the outer contour and boundary in the segmentation process. The strategy of searching around the round area around the seed pixel points, so as to further improve the processing efficiency, the segmented super pixel can be more approximated by the local area or the outer contour of the target, ensuring the relatively fast segmentation with the regular size and shape, and its boundary conforming to the target and the outer contour of the region. The results show that the SLICO-t super pixel segmentation method proposed in this paper is superior to the present SLICO method with high evaluation. In the aspect of the recall rate of the boundary, the relatively stable 8 to 9 percentage point.2 over the SLICO method is more stable than the SLICO method. In the process of region detection, a sparse histogram model is proposed to describe the local region information of the super pixel. The histogram model integrates the local texture, color and shape information. On the basis of this, an image saliency detection method is proposed to make the detection of the significant target or area. A clear and complete separation from the background scene. At the same time, the significant target or region has relatively complete external wheel profile and shape features, as well as local texture details. Finally, the experimental and quantitative evaluation is performed on the open test data set provided by Achanta et al. And compared with the five prevailing methods of detection. The results show that the significant detection method proposed in this paper is superior to other significant detection methods.3 in precision rate, average F-measure and absolute mean error rate. In the process of automatic image tagging and semantic content acquisition, this paper first takes the visual features of the significant targets in the scene as prior knowledge, and perceives the display in the scene. A further mapping and enhancement using the overall local region feature on the basis of a perceived significant target or region. This two-layer mapping process uses two visual features for training and learning. It is a neural network based fusion of decision-making levels in the process of self-learning. At the same time, in the encoding process of image and text semantic information, drawing on the use of a sequential mapping which has been successfully verified, the potential relationship between the image and the semantic text description is revealed more accurately. Finally, the training is carried out on the three common datasets, Flickr8k, Flickr30k and MSCOCO. The results show that the proposed method has been further improved in terms of different recall rates (Recall@K (k=1,5,10)), and the semantic content obtained is more consistent with human cognitive habits and appears to be natural and fluent. The research results of this paper have important reference value for the research of image local feature representation and extraction, image segmentation and the research of image understanding in a wide range of fields.
【学位授予单位】：北京科技大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.41

【相似文献】