基于多模态特征的新闻视频语义分析

发布时间：2018-05-16 08:14

本文选题：标题文字检测 + 多模态特征　；参考：《西安电子科技大学》2012年硕士论文

【摘要】：随着计算机网络和多媒体技术的飞速发展,数字视频在人们的生活中成为不可或缺的信息载体。如何帮助人们从海量视频数据中找到自己感兴趣的内容?国家有关部门如何对危害社会稳定、团结和影响青少年健康成长的不良视频内容进行有效监管?基于语义的多媒体信息检索,视频语义内容安全分析是处理上述问题亟需攻克的技术难题。视频数据包含着丰富的语义内容,作为视频的一种高级语义线索,视频中的文本(包括主题字幕信息和语音脚本)所包含的信息对于视频内容的理解具有很高的价值。如何从视频中检测、抽取主题字幕,如何得到视频语音脚本,如何对同一视频故事的字幕信息和音频脚本进行有效融合等是视频语义信息提取的关键问题。本文提出了一种基于多模态特征融合的新闻视频语义信息提取框架。首先,对主题字幕进行检测、定位、识别；其次,对视频中的音频信息进行分类和语音识别；最后,为解决语音识别结果错误率较高的问题,由主题字幕信息通过搜索引擎得到与视频故事相关的网页,利用网页文本对语音识别的结果纠错。通过自然语言层次上视频字幕信息和音频信息的跨模态融合提高了视频语义提取的准确率。通过对中等规模的实验数据集(包括视频数据和网页库)的测试,结果表明本文提出的分析研究方法的有效性,经纠错后的语音识别准确率达到65%左右。
[Abstract]:With the rapid development of computer network and multimedia technology, digital video has become an indispensable information carrier in people's life. How to help people find out what they are interested in from the huge amount of video data? How can the relevant departments of the state supervise the harmful video content that endangers social stability, unites and affects the healthy growth of young people? Based on semantic multimedia information retrieval, security analysis of video semantic content is a technical problem that needs to be solved urgently. Video data contains abundant semantic content. As a kind of advanced semantic clue of video, the information contained in video text (including topic caption information and voice script) is of great value to the understanding of video content. How to detect, extract topic subtitles from video, how to get video voice script, and how to fuse the subtitle information and audio script of the same video story effectively are the key problems of video semantic information extraction. This paper presents a semantic information extraction framework for news video based on multimodal feature fusion. First, detect, locate and recognize the topic subtitles; secondly, classify and recognize the audio information in the video; finally, in order to solve the problem of high error rate of speech recognition results, The text of the web page is used to correct the result of speech recognition by using the topic subtitle information through the search engine to get the web page related to the video story. The accuracy of video semantic extraction is improved by cross-modal fusion of video subtitle information and audio information at natural language level. The experimental data sets (including video data and webpage library) are tested. The results show that the proposed method is effective and the accuracy of speech recognition is about 65% after error correction.
【学位授予单位】：西安电子科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.41

【参考文献】