基于多模态信息的新闻视频内容分析技术研究

发布时间：2018-12-14 11:33

【摘要】： 对视频数据的有效处理、浏览、检索和管理正伴随着视频数据的快速增长而成为亟待解决的现实问题。视频内容分析技术旨在将非结构化的视频数据结构化,并提取其中的语义内容,构建低层特征到高层语义之间的桥梁,最终建立视频的摘要、索引和检索等应用系统,提供给用户方便的视频内容获取方式。本论文以新闻视频为研究对象,以音频、字幕、视觉等多模态信息及其有效融合为研究手段,以模式识别理论中的相关模型为工具,对视频内容分析技术展开了较为深入的研究。主要贡献包括以下三个方面: (1)提出了一种新颖的基于MPEG压缩域的主持人镜头快速检测算法。其中,在预处理部分,引入了一种改进的利用压缩域信息检测人脸的方法;在镜头聚类部分,构造了一个新颖的度量特征量对主持人镜头采用系统聚类法进行聚类,并用模糊C均值聚类法解决了聚类过程中自适应阈值确定的问题。该算法在保持较高检测性能的前提下提高了主持人镜头的检测速度。 (2)提出了一种基于决策树的镜头分类算法,将新闻视频镜头依次分为广告、“其他”、静态图像、主持人、记者和独白六类。其中广告、“其他”和静态图像三类分别利用黑帧、运动、时间以及人脸等特征进行检测;主持人镜头采用聚类方法进行检测;对于比较难区分的记者和独白镜头,创新性地将它们的检测转换为文本序列标注的问题,并采用条件随机场进行建模。该算法有效地融合了音频、人脸以及上下文等多模态信息,对新闻视频中重要的镜头进行了区分,并取得了较好的分类结果。 (3)提出了一种融合音频、字幕以及视觉等多模态信息的新闻故事单元分割算法。创新性地将字幕变化、音频类型以及镜头类型等高层次内容特征联系起来共同处理,巧妙地将新闻镜头序列转换成为多个关键词序列,使新闻故事单元分割问题转换成为文本序列分割的问题。该算法采用条件随机场进行建模,充分利用了每个序列内以及序列之间的上下文信息,得到了较好的分割性能。此外,论文还综述了视频内容分析技术,构造了一个基于规则和隐马尔可夫模型的分层音频分类方法,实现了一个较完整的新闻视频中字幕提取框架,最终设计并实现了一个基于COM架构的视频内容分析与摘要系统。综上所述,本论文分别从音频、字幕、视觉以及它们之间的有效融合等方面对新闻视频进行了基于内容的分析,实验结果证明了这些算法的有效性。
[Abstract]:With the rapid growth of video data, the efficient processing, browsing, retrieval and management of video data has become a practical problem to be solved. Video content analysis technology aims at structuring unstructured video data, extracting semantic content from it, constructing a bridge between low-level features and high-level semantics, and finally establishing application systems such as summary, index and retrieval of video. Provides the user convenient video content acquisition method. This thesis takes the news video as the research object, takes the multi-modal information such as audio, subtitle, vision and its effective fusion as the research means, and takes the related model in the pattern recognition theory as the tool. The technology of video content analysis is studied deeply. The main contributions are as follows: (1) A novel fast shot detection algorithm based on MPEG compression domain is proposed. In the part of preprocessing, an improved method of using compressed domain information to detect face is introduced. In the part of shot clustering, a novel measure feature is constructed to cluster the host shot using systematic clustering method, and the problem of adaptive threshold determination in the process of clustering is solved by using fuzzy C-means clustering method. The algorithm improves the detection speed of the host shot on the premise of maintaining high detection performance. (2) A shot classification algorithm based on decision tree is proposed, which divides news video shot into six categories: advertisement, "other", still image, host, reporter and monologue. The advertisement, "other" and static images are detected by black frame, motion, time and face respectively, and the host shot is detected by clustering method. For journalists and monologues which are difficult to distinguish, the problem of translating their detection into text sequence tagging is innovated, and the conditional random field is used to model them. The algorithm effectively integrates audio, face and context information, and distinguishes important shots in news video, and achieves good classification results. (3) an algorithm of news story unit segmentation is proposed, which combines audio, subtitle and visual information. Innovative combination of high-level content features, such as subtitle changes, audio types, and shot types, to skillfully convert news shot sequences into multiple keyword sequences. The problem of news story unit segmentation is transformed into the problem of text sequence segmentation. The proposed algorithm uses conditional random fields to model the model and makes full use of the contextual information within and between each sequence to obtain better segmentation performance. In addition, the paper also summarizes the video content analysis technology, constructs a hierarchical audio classification method based on rule and hidden Markov model, and implements a complete subtitle extraction framework in news video. Finally, a video content analysis and summary system based on COM architecture is designed and implemented. To sum up, this paper analyzes the content of news video from audio, subtitle, vision and their effective fusion, respectively. The experimental results show the effectiveness of these algorithms.
【学位授予单位】：天津大学
【学位级别】：博士
【学位授予年份】：2007
【分类号】：TP391.41

【引证文献】