基于镜头及场景上下文的短视频标注方法研究

发布时间：2018-08-09 14:33

【摘要】：随着数字媒体技术、通信技术及网络技术的飞速发展,以视频为代表的数字媒体信息的数量急剧膨胀。短视频是一类内容庞杂的视频数据,如何在海量短视频数据中寻找到有效信息一直是用户关注的问题,由此产生了视频索引、视频检索等相关应用。视频标注就是解决这些应用的核心环节。目前视频标注已成为数字媒体应用和计算机视觉领域中的一个热点研究课题。从语义的角度,视频可以分割成若干种语义单位。不同的语义单位具有不同的语义内涵,在每个语义层次上均可实现语义标注。本文在对视频结构进行深入分析的基础上,对视频片段进行分割,形成不同的语义单位,并在镜头语义层、场景语义层对短视频进行标注。本文的研究成果与创新点主要有:(1)结合视频帧的全局特征和局部特征,提出了一种新的结合视频动态纹理和SIFT特征的镜头边缘检测方法。该方法首先对相邻两帧图像进行均匀分块,在RGB颜色空间下,计算帧中每个图像块的平均梯度。由所有图像块的平均梯度形成视频动态纹理,比较相邻帧图像的动态纹理,并结合相邻帧SIFT特征的匹配情况来判断镜头的变化。该算法对不同类型的视频数据进行镜头边缘检测,均能取得较高的检测准确率。(2)提出一种基于镜头事件的视频语义标注模型。在分析视频结构的基础上,提取镜头中的运动目标和镜头关键帧的背景颜色特征来表达一个镜头的事件,进一步延伸到场景事件的表达,最终由所有事件的集合来作为视频片段的主题。该模型以结合时序上下文的镜头运动对象和环境背景组成的事件组作为标注结果。该标注模型较好地代表了镜头的语义内涵,提高了视频语义表达的准确度。(3)提出一种基于半监督聚类的视频标注新方法。以镜头事件为单位,用事件组来标注视频。为了降低视频标注对已标注样本的依赖,利用半监督学习思想构造半监督K-means聚类算法,优化目标函数,使得最终的聚类结果既体现类间的低耦合及类内的高聚合,又体现类内局部的数据分布密度。该算法实现了诸如视频等多属性异构数据的聚类,提高了视频标注的准确度。(4)提出一种基于上下文的多核学习视频分类新方法。以传统的词袋模型为基础,根据相邻镜头关键帧之间具有相关性的特点提出了一种用于视频场景分类的模型。首先将视频片段进行分割,提取关键帧,对关键帧图像归一化。接着将关键帧图像作为图像块以时序关系合成新图像,提取新图像的SIFT特征及HSV颜色特征,并将图像的SIFT特征及HSV颜色特征数据映射到希尔伯特空间。通过多核学习,选取合适的核函数组对每个图像进行训练,最终得到分类模型,得到较好的分类效果。上述研究成果可广泛应用于视频分类、视频索引、视频检索、视频内容理解、视频数据管理等诸多领域,具有重要的理论意义和较高的应用价值。
[Abstract]:With the rapid development of digital media technology, communication technology and network technology, the number of digital media information represented by video is expanding rapidly. Short video is a kind of video data with a lot of content. How to find effective information in a large amount of short video data has always been a problem of concern to users, resulting in video indexing, video retrieval and other related applications. Video tagging is the core of these applications. At present, video tagging has become a hot research topic in the field of digital media applications and computer vision. From the semantic point of view, video can be divided into several semantic units. Different semantic units have different semantic connotations and can realize semantic annotation at each semantic level. Based on the in-depth analysis of the video structure, the video segment is segmented to form different semantic units, and the short video is annotated in the shot semantic layer and scene semantic layer. The main achievements and innovations of this paper are as follows: (1) combining the global and local features of video frames, a novel shot edge detection method combining video dynamic texture and SIFT features is proposed. In this method, two adjacent frames are partitioned evenly, and the average gradient of each image block in the frame is calculated in RGB color space. The video dynamic texture is formed by the average gradient of all image blocks. The dynamic texture of adjacent frames is compared and the shot change is judged by matching the SIFT features of adjacent frames. This algorithm can detect the shot edge of different types of video data with high accuracy. (2) A video semantic annotation model based on shot events is proposed. Based on the analysis of the video structure, the background color features of the moving object and the key frame of the shot are extracted to express the event of a shot, which extends to the expression of the scene event. Ultimately, the collection of all events is the subject of a video clip. The model takes the event group composed of the shot moving object and the environment background as the annotation result. The annotation model represents the semantic connotation of shot and improves the accuracy of video semantic expression. (3) A new method of video annotation based on semi-supervised clustering is proposed. In the unit of shot event, the video is annotated with event group. In order to reduce the dependence of video tagging on labeled samples, semi-supervised K-means clustering algorithm is constructed by semi-supervised learning idea, and the objective function is optimized, so that the final clustering results can not only reflect the low coupling between classes and high aggregation within classes. It also reflects the local data distribution density in the class. This algorithm implements the clustering of multi-attribute heterogeneous data such as video, and improves the accuracy of video tagging. (4) A new context-based multi-core learning video classification method is proposed. Based on the traditional word bag model, a video scene classification model is proposed according to the correlation between the adjacent shot key frames. Firstly, the video segment is segmented, the key frame is extracted, and the key frame image is normalized. Then the key frame image is used as the image block to synthesize the new image with temporal relation, and the SIFT feature and HSV color feature of the new image are extracted, and the SIFT feature and HSV color feature data of the image are mapped to Hilbert space. Through multi-kernel learning, the appropriate kernel function groups are selected to train each image, and finally the classification model is obtained, and a better classification effect is obtained. These research results can be widely used in many fields such as video classification, video indexing, video retrieval, video content understanding, video data management and so on, which have important theoretical significance and high application value.
【学位授予单位】：上海大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.41

【相似文献】