基于语音和图像的多模态情感识别研究

发布时间：2018-07-05 01:41

本文选题：情感识别 + 语音特征　；参考：《哈尔滨工业大学》2017年硕士论文

【摘要】：随着人工智能的兴起,获得更加人性化、智能化的人机交互体验一直备受关注,这使得情感计算成为研究热点之一。作为情感计算研究领域的一个重要分支,情感识别近年来发展迅速,前景广阔。情感识别研究主要的方法有基于语音的情感识别研究、基于图像的情感识别研究和基于多模态融合的情感识别研究。由于单一的语音或图像模态信息所表达的情感信息是不完整的,不能完全满足人们的期望。而多模态融合的情感识别研究综合了各个模态信息,使各模态信息之间能够互补从而达到更好的识别效果。因此本文选择基于语音和图像的多模态情感识别研究。本文选择包含语音和人脸图像两种模态情感材料的英国萨里大学的Surrey Audio-Visual Expressed Emotion(SAVEE)Database作为标准源数据,进行七种情感(生气、厌恶、恐惧、平静、悲伤、惊讶)识别的相关研究,其主要研究内容如下:1)基于语音的情感识别研究。本文提取共92维语音情感特征,这些特征由短时能量、语音持续时间、基音频率、前三共振峰、梅尔频率倒谱系数(Mel-scale Frequency Cepstral Coeddicients,MFCC)的相关统计学参数组成。所有样本特征提取完成之后,在支持向量机(Support Vector Machine,SVM)上进行情感识别实验,得到了较好的分类结果。2)基于人脸图像的情感识别研究。本文分别提取语音段峰值图像的局部二值模式(Local Binary Pattern,LBP)以及序列图像脸部特征点的均值和标准差作为图像情感特征。在所有样本特征提取完成之后,通过SVM进行情感识别实验,并对在不同特征上得到的情感识别结果进行对比。最终基于序列图像脸部特征点特征提取方法取得的识别结果好于基于语音段峰值图像LBP特征提取方法。3)基于语音和图像的多模态融合情感识别研究。本文分别采用特征层融合和决策层融合策略对语音模态信息和图像模态信息进行融合,并在SVM上进行情感识别实验,将其得到的识别结果与单一模态情感识别结果进行对比,并比较特征层融合策略得到的识别结果与决策层融合策略得到的识别结果,验证了基于语音和图像的多模态情感识别比单一模态情感识别表现更佳,且决策层融合效果好于特征层融合,实验表明了决策层融合有助于提高恐惧类情感的识别率。
[Abstract]:With the rise of artificial intelligence, more humanization and intelligent human-computer interaction experience have been paid more attention, which makes emotional computing become one of the research hotspots. As an important branch of affective computing research, emotional recognition has developed rapidly in recent years and has broad prospects. The main methods of emotion recognition are speech based emotion recognition, image based emotion recognition and multimodal fusion. Because the emotion information expressed by a single voice or image modal information is incomplete, it can not completely meet the expectations of people. The research of emotion recognition based on multi-modal fusion synthesizes all modal information, which makes each modal information complement each other so as to achieve better recognition effect. So this paper chooses multi-modal emotion recognition based on speech and image. In this paper, Surrey Audio-Visual expressed emotion (SAVEE) Database, which includes speech and face images, was selected as the standard source data to study the recognition of seven emotions (anger, disgust, fear, calm, sadness, surprise). The main research contents are as follows: 1) emotion recognition based on speech. In this paper, the affective characteristics of 92-dimensional speech are extracted. These features are composed of the statistical parameters of short-term energy, speech duration, pitch frequency, the first three resonance peaks, and Mel-scale Frequency Cepstral CoeddicientsMFCC (Mel-scale Frequency Cepstral CoeddicientsMFCC). After the feature extraction of all samples is completed, the emotion recognition experiment is carried out on support Vector Machine (SVM), and a better classification result is obtained. 2) the affective recognition based on face image is studied. In this paper, the local binary pattern (LBP) of the peak image of speech segment and the mean and standard deviation of the facial feature points of the sequence image are extracted as the emotional features of the image. After all the sample features are extracted, the experiment of emotion recognition is carried out by SVM, and the results of emotion recognition on different features are compared. Finally, the result of facial feature point extraction based on sequence image is better than that of LBP feature extraction method based on speech segment peak image. (3) Multi-modal fusion emotion recognition based on speech and image is studied. In this paper, feature level fusion and decision level fusion strategy are used to fuse speech modal information and image modal information, and emotion recognition experiments are carried out on SVM, and the results obtained are compared with the results of single modal emotion recognition. Comparing the recognition results obtained by feature level fusion strategy and decision level fusion strategy, it is verified that multi-modal emotion recognition based on speech and image performs better than single modal emotion recognition. The effect of decision level fusion is better than that of feature level fusion. Experiments show that decision level fusion can improve the recognition rate of fear emotion.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TN912.3;TP18

【参考文献】