基于多特征融合的深度视频自然语言描述方法

发布时间：2018-07-15 10:29

【摘要】：针对计算机对视频进行自动标注和描述准确率不高的问题,提出一种基于多特征融合的深度视频自然语言描述的方法。该方法提取视频帧序列的空间特征、运动特征、视频特征,进行特征的融合,使用融合的特征训练基于长短期记忆(LSTM)的自然语言描述模型。通过不同的特征组合训练多个自然语言描述模型,在测试时再进行后期融合,即先选择一个模型获取当前输入的多个可能的输出,再使用其他模型计算当前输出的概率,对这些输出的概率进行加权求和,取概率最高的作为输出。此方法中的特征融合的方法包括前期融合:特征的拼接、不同特征对齐加权求和;后期融合:不同特征模型输出的概率的加权融合,使用前期融合的特征对已生成的LSTM模型进行微调。在标准测试集MSVD上进行实验,结果表明:融合不同类型的特征方法能够获得更高评测分值的提升;相同类型的特征融合的评测结果不会高于单个特征的分值;使用特征对预训练好的模型进行微调的方法效果较差。其中使用前期融合与后期融合相结合的方法生成的视频自然语言描述得到的METEOR评测分值为0.302,比目前查到的最高值高1.34%,表明该方法可以提升视频自动描述的准确性。
[Abstract]:In order to solve the problem of low accuracy of automatic video tagging and description by computer, a method of deep video natural language description based on multi-feature fusion is proposed. In this method, the spatial features, motion features, video features of video frame sequences are extracted, and the fusion features are used to train the natural language description model based on LSTM. Several natural language description models are trained by different feature combinations, and later fusion is carried out during the test. That is to say, one model is selected to obtain multiple possible outputs of the current input first, and then the probability of the current output is calculated by using other models. The probability of these outputs is weighted and the highest probability is taken as the output. The methods of feature fusion in this method include early fusion: feature splicing, weighted summation of different feature alignment, and later fusion: weighted fusion of probability of output of different feature models. The pre-fusion features are used to fine-tune the generated LSTM model. The experiment on MSVD standard test set shows that the fusion of different types of features can obtain higher evaluation scores, the same type of feature fusion results are not higher than the single feature scores, and the results of the same type of feature fusion are not higher than that of the single feature, and the evaluation results of the same type of feature fusion are not higher than that of the single feature. The method of fine-tuning the pre-trained model by using features is not effective. Among them, the METEOR value of the video natural language description generated by the combination of pre-fusion and post-fusion is 0.302, which is 1.34 higher than the highest value found at present, which indicates that this method can improve the accuracy of video automatic description.
【作者单位】：电子科技大学信息与软件工程学院;电子科技大学计算机科学与工程学院;
【基金】：国家自然科学基金资助项目(61300192) 中央高校基本科研业务费专项资金资助项目(ZYGX2014J052)~~
【分类号】：TP391.41

【相似文献】