基于深度学习的视频分析系统

发布时间：2018-09-07 12:18

【摘要】：每分钟都要海量的视频通过互联网共享出去,著名视频分享网站youtube每分钟上传视频的总时长达100+小时,所以很有必要对这些视频分类检索方便用户选择兴趣内容以应对多媒体信息大爆炸。并且合理的分析理解这些视频对于网站对于提高网站流量,网站业务分析都有很大的作用。本文通过将深度学习与视频分析技术相结合,提出了一种基于深度学习的视频分析系统。该系统使用深度学习中的C3D网络和CNN网络来提取出背景特征和行为特征,然后将提取出来的特征通过多层LSTM网络,经过一系列的加权操作,将背景特征与行为特征依据加权后的可能性组合成描述以完成视频的描述分析工作。为了高效准确的识别出视频中出现的背景和动作特征,本文提出了一种基于CNN模型的改进模型架构——C3D模型。相对于传统CNN模型,C3D模型对CNN中的卷积(convolution)操作和池化(pooling)操作进行改良,即针对视频的时序特征在原有与空间序列关联的基础上添加了时序特性,即3d convolution操作和3d pooling操作,使得能够提取和保持更多的特征,提高背景识别和动作行为识别的准确度。为了将提取出来的特征有效合理的组成有效描述,本文提出了一种基于LSTM模型的多层LSTM模型。C3D顶层所提取出来特征趋向于关注于全局视觉感知视野,而那些底层所提取的特征更加关注于精细、局部的特征,一个有效准确的描述不应该只关注于顶层的宏观特征,而是应该结合底层细节特征共同对于视频内容进行描述。对此我们提出了多层LSTM模型同时提取底层特征和顶层特征来更加准确的描述视频的内容。最后本文陈述了具体的基于深度学习的视频分析系统的主要功能模块的实现和实验数据结果。通过对这些结果分析,系统满足实际需求,具有较强的工程价值和实用价值。
[Abstract]:Every minute a huge amount of video is shared through the Internet. YouTube, a famous video sharing website, uploads video up to 100 + hours per minute. Therefore, it is necessary to classify and retrieve these videos so that users can choose the content they are interested in to cope with the explosion of multimedia information. This paper proposes a video analysis system based on in-depth learning by combining in-depth learning with video analysis technology. The system uses C3D network and CNN network in-depth learning to extract background features and behavior features, and then extracts the features. Through a series of weighted operations, background features and behavior features are combined into descriptions according to weighted possibilities to perform video description and analysis. In order to identify background and action features efficiently and accurately, an improved model architecture based on CNN model, C3D, is proposed in this paper. Comparing with the traditional CNN model, C3D model improves the convolution and pooling operations in CNN, that is, adds the temporal characteristics to the video sequence based on the original spatial sequence association, that is, 3D convolution operation and 3D pooling operation, so that more features can be extracted and maintained. In order to describe the extracted features effectively and reasonably, a multi-layer LSTM model based on the LSTM model is proposed. The features extracted from the top layer of C3D tend to focus on the global visual perception vision, while those extracted from the bottom layer focus more on the fine. Local features, an effective and accurate description should not only focus on the top-level macro-features, but should be combined with the bottom-level details of the video content description. We propose a multi-level LSTM model to extract both the bottom-level features and top-level features to describe the video content more accurately. Through the analysis of these results, the system meets the actual needs and has strong engineering value and practical value.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.41;TP18

【相似文献】