Research on Affective Visual Question Answering

发布时间:2021-03-23 02:58
  视觉问答(Visual Question Answering,VQA)最近引起了机器学习领域研究人员的广泛关注。已经有许多研究者提出了不同的注意力模型,以解决关注图像的局部区域的需要,但研究人员在特征提取过程中遗漏了图像和视频的基本情感信息,且答案中也没有提供太多的情感,导致生成的答案不够自然、真实。因此,论文旨在通过增加对问题和图像(视频)中情感信息的分析,生成体现情感的更自然的答案,填补VQA中未体现情感信息的不足。具体来说,本文主要关注具有单一情感的图像、具有多种情感的图像以及针对视频的VQA问题。研究成果可直接应用于教育、盲人视觉辅助、健康以及其它领域。主要贡献如下:(1)提出基于注意力模型的单一情感感知图像问答生成(Mood-Aware Image Question Answering,MAIQA)方法,该方法结合局部图像特征、从图像特定区域和问题中检测到情绪信息,以产生包含情感信息的答案。这里的情感仅仅指出现在图像中人物的情感而非其他物体的情感。具体而言,图像、问题和情感的特征被嵌入到单个长短时记忆网络(Long Short Term Memory,LSTM)中,且分别采用... 

【文章来源】:江苏大学江苏省

【文章页数】:128 页

【学位级别】:博士

【文章目录】:
Abstract
摘要
Chapter 1 Introduction
    1.1 Background and motivation
    1.2 Challenges
    1.3 Contributions
    1.4 Outline of the dissertation
Chapter 2 Review of related literature
    2.1 Visual question answering
        2.1.1 Image question answering
        2.1.2 Video question answering
    2.2 Mood detection
        2.2.1 Mood detection on images
        2.2.2 Mood detection on videos
    2.3 Visual captioning
        2.3.1 Image captioning
        2.3.2 Video captioning
    2.4 Multi-task learning
    2.5 Feature embeddings
    2.6 Visual mood attribute detection
    2.7 Attention models
    2.8 Traditional visual question answering
Chapter 3 Mood-aware image question answering
    3.1 Introduction
    3.2 The MAIQA model
        3.2.1 Image, question and mood embeddings
        3.2.2 Attention models for the image, question and mood
        3.2.3 Feature learning and inference
        3.2.4 Vocabulary
        3.2.5 Feature fusion
        3.2.6 Answer prediction
    3.3 Experiments and results
        3.3.1 The image dataset customization
        3.3.2 Experiment setup
        3.3.3 Qualitative analysis of sample answers
        3.3.4 Comparison of our mood detector with other baseline models
        3.3.5 Possible answer categories
        3.3.6 Comparison of the performance of our attention models
        3.3.7 Comparison of the MAIQA LSTM model with other models
    3.4 Brief summary
Chapter 4 Multi-mood image question answering
    4.1 Introduction
    4.2 The MMIQA model
        4.2.1 Image feature extraction, embedding and attention
        4.2.2 Question feature embedding and attention
        4.2.3 Mood feature detection, embedding and attention
        4.2.4 Triple attention model
        4.2.5 Answer vocabulary
        4.2.6 Fusion of features
        4.2.7 Answer generation
    4.3 Experiments and results
        4.3.1 The image dataset customization
        4.3.2 Experiment setup
        4.3.3 Qualitative analysis
        4.3.4 Comparison of feature embedding techniques using different dataset conditions
        4.3.5 Comparison of validation results of our feature embedding techniques
        4.3.6 Comparison of the accuracy of different multi-mood detectors
        4.3.7 Analysis of the contribution of the multi-mood detector to performance of MMIQA
        4.3.8 Overall comparison of MMIQA with the baseline model
    4.4 Brief summary
Chapter 5 Multi-mood video question answering
    5.1 Introduction
    5.2 The MMVQA model
        5.2.1 Overview
        5.2.2 Video QA route for the main question answering task
        5.2.3 Affective route for mood detection
        5.2.4 Prediction of the conventional and affective answers
    5.3 Experiments and results
        5.3.1 Video datasets
        5.3.2 Experiment setup
        5.3.3 Comparison with mood detection baseline model
        5.3.4 Attention model ablation studies
        5.3.5 Analysis of the accuracy of MMVQA conventional answers
        5.3.6 Analysis of the accuracy of MMVQA affective answers
        5.3.7 Qualitative analysis
    5.4 Brief summary
Chapter 6 General conclusions and future work
    6.1 General conclusions
    6.2 Our work
    6.3 Future work
Bibliography
Acknowledgements
Academic Publications



本文编号:3094998

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/shengwushengchang/3094998.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户7accc***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com