当前位置:主页 > 科技论文 > 软件论文 >

融合图像场景及物体先验知识的图像描述生成模型

发布时间:2019-04-10 19:13
【摘要】:目的目前基于深度卷积神经网络(CNN)和长短时记忆(LSTM)网络模型进行图像描述的方法一般是用物体类别信息作为先验知识来提取图像CNN特征,忽略了图像中的场景先验知识,造成生成的句子缺乏对场景的准确描述,容易对图像中物体的位置关系等造成误判。针对此问题,设计了融合场景及物体类别先验信息的图像描述生成模型(F-SOCPK),将图像中的场景先验信息和物体类别先验信息融入模型中,协同生成图像的描述句子,提高句子生成质量。方法首先在大规模场景类别数据集Place205上训练CNN-S模型中的参数,使得CNN-S模型能够包含更多的场景先验信息,然后将其中的参数通过迁移学习的方法迁移到CNNd-S中,用于捕捉待描述图像中的场景信息;同时,在大规模物体类别数据集Imagenet上训练CNN-O模型中的参数,然后将其迁移到CNNd-O模型中,用于捕捉图像中的物体信息。提取图像的场景信息和物体信息之后,分别将其送入语言模型LM-S和LMO中;然后将LM-S和LM-O的输出信息通过Softmax函数的变换,得到单词表中每个单词的概率分值;最后使用加权融合方式,计算每个单词的最终分值,取概率最大者所对应的单词作为当前时间步上的输出,最终生成图像的描述句子。结果在MSCOCO、Flickr30k和Flickr8k 3个公开数据集上进行实验。本文设计的模型在反映句子连贯性和准确率的BLEU指标、反映句子中单词的准确率和召回率的METEOR指标及反映语义丰富程度的CIDEr指标等多个性能指标上均超过了单独使用物体类别信息的模型,尤其在Flickr8k数据集上,在CIDEr指标上,比单独基于物体类别的Object-based模型提升了9%,比单独基于场景类别的Scene-based模型提升了近11%。结论本文所提方法效果显著,在基准模型的基础上,性能有了很大提升;与其他主流方法相比,其性能也极为优越。尤其是在较大的数据集上(如MSCOCO),其优势较为明显;但在较小的数据集上(如Flickr8k),其性能还有待于进一步改进。在下一步工作中,将在模型中融入更多的视觉先验信息,如动作类别、物体与物体之间的关系等,进一步提升描述句子的质量。同时,也将结合更多视觉技术,如更深的CNN模型、目标检测、场景理解等,进一步提升句子的准确率。
[Abstract]:Objective at present, the methods of image description based on the deep convolution neural network (CNN) and the long-and long-term memory (LSTM) network model are usually based on the prior knowledge of the object category information to extract the CNN features of the image. Ignoring the priori knowledge of the scene in the image, resulting in the lack of accurate description of the scene in the generated sentences, it is easy to misjudge the position relationship of the object in the image. In order to solve this problem, an image description generation model (F-SOCPK) which combines the prior information of scene and object category is designed. The scene priori information in the image and the prior information of the object category are incorporated into the model, and the description sentences of the image are generated in collaboration. Improve the quality of sentence generation. Methods first, the parameters of CNN-S model were trained on the large-scale scene data set Place205, so that the CNN-S model could contain more prior information of the scene, and then the parameters of the model were migrated to CNNd-S by the method of migration learning. For capturing scene information in an image to be described; At the same time, the parameters in the CNN-O model are trained on the large-scale object class data set Imagenet, and then transferred to the CNNd-O model to capture the object information in the image. After extracting the scene information and object information of the image, they are fed into the language model LM-S and LMO respectively, and then the output information of LM-S and LM-O is transformed by Softmax function to get the probability score of each word in the single word list. Finally, the final value of each word is calculated by using the weighted fusion method, and the word corresponding to the maximum probability is taken as the output of the current time step, and finally the description sentence of the image is generated. Results the experiment was carried out on three open datasets MSCOCO,Flickr30k and Flickr8k. The model designed in this paper can reflect the BLEU index of sentence coherence and accuracy. The METEOR index, which reflects the accuracy and recall rate of the words in the sentence, and the CIDEr index, which reflect the semantic richness, all exceed the models that use the object category information alone, especially on the Flickr8k data set and on the CIDEr index. It's 9% higher than the Object-based model based on the object category alone and nearly 11% higher than the Scene-based model based on the scene category alone. Conclusion the method presented in this paper has a remarkable effect, and its performance is greatly improved on the basis of the benchmark model, and the performance of the proposed method is superior to that of other mainstream methods. Especially on larger data sets (such as MSCOCO), its advantages are obvious, but on smaller data sets (such as Flickr8k), its performance needs to be further improved. In the next step, more visual priori information, such as action category, object-to-object relationship and so on, will be incorporated into the model to further improve the quality of the description sentence. At the same time, more visual techniques, such as deeper CNN model, target detection, scene understanding and so on, will be combined to further improve the accuracy of sentences.
【作者单位】: 井冈山大学数理学院;井冈山大学流域生态与地理环境监测国家测绘地理信息局重点实验室;同济大学计算机科学与技术系;井冈山大学电子与信息工程学院;
【基金】:流域生态与地理环境监测国家测绘地理信息局重点实验室基金项目(WE2016015) 江西省教育厅科学技术研究项目(GJJ160750,GJJ150788) 井冈山大学科研基金项目(JZ14012)~~
【分类号】:TP391.41

【相似文献】

相关期刊论文 前10条

1 周卫东,冯其波,匡萃方;图像描述方法的研究[J];应用光学;2005年03期

2 吴娱;赵嘉济;平子良;杜昊翔;;基于指数矩的图像描述[J];现代电子技术;2013年14期

3 任越美;程显毅;李小燕;谢玉宇;;基于概念级语义的图像描述与识别[J];计算机科学;2008年07期

4 毛玉萃;;一种面向用户需求的图像描述方法[J];制造业自动化;2010年11期

5 周昌;郑雅羽;周凡;陈耀武;;基于局部图像描述的目标跟踪方法[J];浙江大学学报(工学版);2008年07期

6 宫伟力;安里千;赵海燕;毛灵涛;;基于图像描述的煤岩裂隙CT图像多尺度特征[J];岩土力学;2010年02期

7 胡美燕,姜献峰,柴国钟;Hu矩在一次性输液针图像描述中的应用[J];中国图象图形学报;2005年02期

8 谢玉鹏;吴海燕;;基于AAM的人脸图像描述与编码[J];计算机仿真;2009年06期

9 阿木古楞,杨性愉,平子良;用变形雅可比(p=4,q=3)-傅立叶矩进行图像描述[J];光电子·激光;2003年09期

10 于永新;冯志勇;;基于常识库支持的图像描述和检索系统[J];计算机应用研究;2009年02期

相关博士学位论文 前2条

1 梁浩然;自然图像的视觉显著性特征分析与检测方法及其应用研究[D];浙江工业大学;2016年

2 汤进;基于图理论的图像描述与检索方法研究[D];安徽大学;2007年

相关硕士学位论文 前2条

1 钟艾妮;人脸识别中图像描述方法的研究[D];哈尔滨工业大学;2010年

2 陈影;基于复杂网络理论的图像描述与识别方法研究[D];安徽大学;2014年



本文编号:2456050

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2456050.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户c6086***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com