融合图像场景及物体先验知识的图像描述生成模型

发布时间：2019-04-10 19:13

【摘要】：目的目前基于深度卷积神经网络(CNN)和长短时记忆(LSTM)网络模型进行图像描述的方法一般是用物体类别信息作为先验知识来提取图像CNN特征,忽略了图像中的场景先验知识,造成生成的句子缺乏对场景的准确描述,容易对图像中物体的位置关系等造成误判。针对此问题,设计了融合场景及物体类别先验信息的图像描述生成模型(F-SOCPK),将图像中的场景先验信息和物体类别先验信息融入模型中,协同生成图像的描述句子,提高句子生成质量。方法首先在大规模场景类别数据集Place205上训练CNN-S模型中的参数,使得CNN-S模型能够包含更多的场景先验信息,然后将其中的参数通过迁移学习的方法迁移到CNNd-S中,用于捕捉待描述图像中的场景信息;同时,在大规模物体类别数据集Imagenet上训练CNN-O模型中的参数,然后将其迁移到CNNd-O模型中,用于捕捉图像中的物体信息。提取图像的场景信息和物体信息之后,分别将其送入语言模型LM-S和LMO中;然后将LM-S和LM-O的输出信息通过Softmax函数的变换,得到单词表中每个单词的概率分值;最后使用加权融合方式,计算每个单词的最终分值,取概率最大者所对应的单词作为当前时间步上的输出,最终生成图像的描述句子。结果在MSCOCO、Flickr30k和Flickr8k 3个公开数据集上进行实验。本文设计的模型在反映句子连贯性和准确率的BLEU指标、反映句子中单词的准确率和召回率的METEOR指标及反映语义丰富程度的CIDEr指标等多个性能指标上均超过了单独使用物体类别信息的模型,尤其在Flickr8k数据集上,在CIDEr指标上,比单独基于物体类别的Object-based模型提升了9%,比单独基于场景类别的Scene-based模型提升了近11%。结论本文所提方法效果显著,在基准模型的基础上,性能有了很大提升;与其他主流方法相比,其性能也极为优越。尤其是在较大的数据集上(如MSCOCO),其优势较为明显;但在较小的数据集上(如Flickr8k),其性能还有待于进一步改进。在下一步工作中,将在模型中融入更多的视觉先验信息,如动作类别、物体与物体之间的关系等,进一步提升描述句子的质量。同时,也将结合更多视觉技术,如更深的CNN模型、目标检测、场景理解等,进一步提升句子的准确率。
[Abstract]:Objective at present, the methods of image description based on the deep convolution neural network (CNN) and the long-and long-term memory (LSTM) network model are usually based on the prior knowledge of the object category information to extract the CNN features of the image. Ignoring the priori knowledge of the scene in the image, resulting in the lack of accurate description of the scene in the generated sentences, it is easy to misjudge the position relationship of the object in the image. In order to solve this problem, an image description generation model (F-SOCPK) which combines the prior information of scene and object category is designed. The scene priori information in the image and the prior information of the object category are incorporated into the model, and the description sentences of the image are generated in collaboration. Improve the quality of sentence generation. Methods first, the parameters of CNN-S model were trained on the large-scale scene data set Place205, so that the CNN-S model could contain more prior information of the scene, and then the parameters of the model were migrated to CNNd-S by the method of migration learning. For capturing scene information in an image to be described; At the same time, the parameters in the CNN-O model are trained on the large-scale object class data set Imagenet, and then transferred to the CNNd-O model to capture the object information in the image. After extracting the scene information and object information of the image, they are fed into the language model LM-S and LMO respectively, and then the output information of LM-S and LM-O is transformed by Softmax function to get the probability score of each word in the single word list. Finally, the final value of each word is calculated by using the weighted fusion method, and the word corresponding to the maximum probability is taken as the output of the current time step, and finally the description sentence of the image is generated. Results the experiment was carried out on three open datasets MSCOCO,Flickr30k and Flickr8k. The model designed in this paper can reflect the BLEU index of sentence coherence and accuracy. The METEOR index, which reflects the accuracy and recall rate of the words in the sentence, and the CIDEr index, which reflect the semantic richness, all exceed the models that use the object category information alone, especially on the Flickr8k data set and on the CIDEr index. It's 9% higher than the Object-based model based on the object category alone and nearly 11% higher than the Scene-based model based on the scene category alone. Conclusion the method presented in this paper has a remarkable effect, and its performance is greatly improved on the basis of the benchmark model, and the performance of the proposed method is superior to that of other mainstream methods. Especially on larger data sets (such as MSCOCO), its advantages are obvious, but on smaller data sets (such as Flickr8k), its performance needs to be further improved. In the next step, more visual priori information, such as action category, object-to-object relationship and so on, will be incorporated into the model to further improve the quality of the description sentence. At the same time, more visual techniques, such as deeper CNN model, target detection, scene understanding and so on, will be combined to further improve the accuracy of sentences.
【作者单位】：井冈山大学数理学院;井冈山大学流域生态与地理环境监测国家测绘地理信息局重点实验室;同济大学计算机科学与技术系;井冈山大学电子与信息工程学院;
【基金】：流域生态与地理环境监测国家测绘地理信息局重点实验室基金项目(WE2016015) 江西省教育厅科学技术研究项目(GJJ160750,GJJ150788) 井冈山大学科研基金项目(JZ14012)~~
【分类号】：TP391.41

【相似文献】