跨库语音情感识别若干关键技术研究

发布时间：2018-01-03 17:18

本文关键词：跨库语音情感识别若干关键技术研究　出处：《东南大学》2016年博士论文　论文类型：学位论文

【摘要】：语音情感识别(Speech Emotion Recognition,SER)是目前情感计算、模式识别、信号处理和人机交互领域的热门研究话题。SER的主要目的是对语音信号按照不同的情感进行分类,比如"生气"、"恐惧"、"厌恶"、"高兴"等。在过去的几年里,已经提出了许多有效的方法来应对SER中出现的问题。在各种研究方法中,大部分是集中在一个单一的语音数据库上进行的。然而,在许多实际应用情况下,用于训练的语料库与测试语料库存在非常大的差异,例如训练和测试数据库来自两种(或更多种)不同的语言、说话人、文化、分布方式、数据规模等。这就出现了一个重要的研究内容:跨数据库(Cross-corpus)的语音情感识别。由于SER的研究涉及特征提取、特征优选、分类器改进、特征融合等多个技术部分,因此本文根据其特点,针对跨数据库语音情感识别相关的关键技术进行研究。论文的主要研究内容如下:1.针对跨库语音情感特征优选分类,提出了带有无限成分数的t分布混合模型(iSMM)。它可以直接对多种语音情感样本进行有效的识别。与传统的高斯混合模型(GMM)相比,基于混合t分布的语音情感模型能有效处理样本特征空间中存在异常值的问题。首先,t分布混合模型对用于测试的非典型情感数据能够保持鲁棒性。其次,针对高维空间引起的数据高复杂度和训练样本不足的问题,本文将全局隐空间加入情感模型。这种方法使样本空间被划分的成分数量为无限,形成一个iSMM情感模型。此外,该模型可以自动确定最佳的成分数量,同时满足低复杂性,进而完成多种情感特征数据的分类。为验证所提出的iSMM模型对于不同情感特征分布空间的识别效果,本文在3个数据库上进行仿真实验,分别是:表演型语料库DES、EMO-DB和自发型语料库FAU。它们都是通用的语音情感数据库,且具有高维特征样本和不同的空间分布。在这种实验条件下,验证了各个模型对于特征异常值和高维数据的优选效果以及模型本身的泛化性。结果显示iSMM相比其它对比模型,保持了更稳定的识别性能。因此说明本文提出的基于无限t分布的情感模型,在处理不同来源的语音数据时具有较好的鲁棒性,且对带有离群值的高维情感特征具有良好的优选识别能力。2.结合K近邻、核学习方法、特征线重心法和LDA算法,提出了用于情感识别的LDA+kernel-KNNFLC方法。首先针对过大的先验样本特征数目造成的计算量庞大问题,采用重心准则学习样本距离,改进了核学习的K近邻方法;然后加入LDA对情感特征向量优化,在避免维度冗余的情况下,更好的保证了类间情感信息识别的稳定性。对于跨库领域的研究,关注了独立数据库中不同类别间边界拟合度过高导致的识别性能差异;通过对特征空间再学习,所提出的分类方法优化了情感特征向量的类间区分度,适合于不同语料来源的情感特征分类。在包含高维全局统计特征的两个语音情感数据库上进行了仿真实验。通过降维方案、情感分类器和维度参数进行多组实验对比分析,结果表明:LDA+kernel-KNNFLC方法在同条件下识别性能有显著提升,具有相对稳定的情感类别间分类能力。3.针对跨库条件下情感特征类别的改进(扩充)研究,提出了基于听觉注意模型的语谱图特征提取方法。模型模拟人耳听觉特性,能有效探测语谱图上变化的情感特征。同时,利用时频原子对模型进行改进,取得频率特性信号匹配的优势,从时域上提取情感信息。在语音情感识别技术中,由于噪声环境、说话方式和说话人特质等原因,会造成特征空间分布不匹配的情况。从语音学上分析,该问题多存在于跨数据库情感识别任务中。训练的声学模型和用于测试的语句样本之间的错位,会使语音情感识别性能急剧下降。语谱图的特征能从图像的角度对现有情感特征进行有效的补充。听觉注意机制使模型能提取跨语音数据库中的显著性特征,提高语音情感识别系统的情感辨识能力。仿真实验部分利用文章所提出的方法在跨库情感样本上进行特征提取,再通过典型的分类器进行识别。结果显示:与国际通用的标准方法相比,语谱图情感特征的识别性能提高了约9个百分点,从而验证了该方法对不同数据库具有更好的鲁棒性。4.利用深度学习领域的深度信念模型,提出了基于深度信念网络的特征层融合方法。将语音频谱图中隐含的情感信息作为图像特征,与传统声学情感特征融合。研究解决了跨数据库语音情感识别中,将不同尺度上提取的情感特征相结合的技术难点。利用STB/Itti模型对语谱图进行分析,从颜色、亮度、方向三个角度出发提取语谱图特征;然后研究改进了 DBN网络模型,并利用其对传统声学特征与语谱图特征进行了特征层融合,扩充了特征子集的尺度,提升了情感表征能力。通过在ABC数据库和多个中文数据库上的实验验证,特征融合后的新特征子集相比传统的语音情感特征,其跨数据库识别性能获得了明显提升。5.研究了由跨数据库条件下不同语言的使用和大量非特定说话人引起的SER模型特征自适应问题。根据前面章节所介绍的跨库语音情感识别的内容,对特征参数失真、语谱图特征构造、建模算法对比、在线优化等方面进行了自适应相关的研究,并对具体的实验性能进行了比较分析。首先,讨论了现有的语音情感识别自适应方法。然后,对于跨库的情况,进一步研究了自适应说话人加性特征失真的情况,并给出模型方案。接着,为研究多说话人自适应问题给SER系统带来的影响,对其过程进行建模,将高斯混合模型与学生t分布模型两种统计方法进行对比讨论。再分别利用各自适应方案来获取包括语谱图特征在内的特征函数集。此外,还使用了一些在线数据对特征函数进行了快速优化。最后,在四种不同语言的数据库上(包括:德语、英语、中文和越南语)验证了各自适应方案的有效性。实验结果表明:改进的自适应方案具有良好的说话人特征自适应效果,尤其在处理大量未知说话人的情况下显示了较好的模型参数迁移能力。此外,对于由跨数据库中不同语言对情感特性的影响,从特征自适应角度进行了实验分析和讨论。
[Abstract]:Speech emotion recognition (Speech Emotion, Recognition, SER) is currently the affective computing, pattern recognition, the main purpose of.SER hot research topic in the field of signal processing and human-computer interaction is the voice signal according to the different emotion classification, such as "angry", "fear", "hate", "happy" in the past. In recent years, many effective methods have been proposed to deal with SER problems. In various research methods, mostly concentrated in a single speech database on. However, in many practical applications, for there is a big difference between the corpus and the test corpus training, such as training and the test database from two (or more) different languages, speaker, culture, distribution, the scale of the data. It is an important research content: cross database (Cross-corpus) of the speech emotion recognition. Because SER involves the study of feature extraction, feature selection, classifier, feature fusion and multiple technology, so this paper according to the characteristics of the cross database of speech emotion recognition key technology related research. The main contents of this dissertation are as follows: 1. for the cross database of speech emotion feature selection classification, put forward the infinite into scores the mixed model with t distribution (iSMM). It can effectively recognize the variety of emotional speech samples. And Gauss mixture model (GMM) compared with the traditional mixed speech emotion t model can effectively deal with the abnormal values of the samples in the feature space based on t. Firstly, a mixed model of distribution for atypical emotion the test data is able to maintain robustness. Secondly, the problem of high complexity and the lack of training samples in high dimensional space caused by the data, the global implicit space with emotion Model. This method makes the number of components in the sample space is divided into infinity, forming a iSMM emotion model. In addition, the model can automatically determine the optimal number of components, and meet the low complexity, and then complete the classification of many emotional feature data. Identification results for the validation of the proposed iSMM model for the spatial distribution of the different emotions in this paper, 3 databases for the simulation experiments were performed EMO-DB and DES corpus, the spontaneous corpus FAU. they are a common emotional speech database, and has a high dimensional feature space and the distribution of the sample is not the same. In this experiment, verify the generalization effect of each model for abnormal characteristics optimization value and high dimensional data and the model itself. The results show that iSMM compared to other comparative model, keep the recognition performance more stable. So based on the no The affective model of distribution of T, has good robustness in the voice data from different sources, and the high dimensional emotion feature with outliers has a good ability to identify the preferred.2. binding K neighbor, kernel methods, feature line centroid method and LDA algorithm, LDA+kernel-KNNFLC method for emotion recognition is proposed firstly. According to the huge amount of computation problems caused by large number of prior sample characteristics, the focus criterion learning sample distance, improved K nearest neighbor method of kernel learning; then add LDA to vector optimization in dimension of emotional characteristics, to avoid redundancy, ensuring better inter class emotion recognition stability. For the study of cross database in the field of attention to the result of high degree of recognition performance differences between different categories of quasi independent boundary database; through the study of the feature space, the proposed classification method was optimized. The sense of eigenvector separability between classes, suitable for the classification of emotional characteristics of different data sources. The simulation experiments are carried out in the two speech emotion database contains high dimensional global statistical characteristics. Based on the dimension reduction scheme, emotion classifier and dimension parameters were analyzed, the experimental results show that the LDA+kernel-KNNFLC method under the same conditions the recognition performance was significantly improved, improved according to the characteristics of emotional characteristics of cross database under the condition of relatively stable emotion categories classification ability of.3. (Extended) research, put forward the model of auditory language feature extraction method based on spectrum of tut. The model simulates human auditory characteristics, can effectively detect the spectral characteristics of emotional map changes. At the same time, the use of time-frequency atoms by improving the model and get the signal frequency characteristics, the advantage of extracting affective information from the time domain. In the speech emotion recognition technology, due to noise The environment, speech and speaker characteristics and other reasons, will cause the spatial distribution of the situation does not match. From the analysis of phonetics, the problem exists in the cross database emotion recognition task. Acoustic model training and testing samples for dislocation between the statement, will make the speech emotion recognition performance dramatically. Spectrogram the characteristics can be an effective supplement to the existing emotional features from the image point of view. The auditory attention mechanism so that the model can extract salient features of cross speech database, improve the emotion recognition ability of speech emotion recognition system. Simulation results using this approach for feature in cross database emotion sample extraction, identification through the typical classifier. The results showed that: compared with the standard of international practice, the recognition performance of spectrum feature increased by about 9 percentage points, which verified the This method has better robustness to different database.4. using the depth of field of the deep learning belief model, proposed the characteristics of deep belief network layer fusion method based on implicit spectrum in speech emotion information as image feature fusion and feature of traditional acoustic emotion. Research the cross database of speech emotion recognition, will affective feature extraction on different scales combined with technical difficulties. Use the STB/Itti model to the spectrogram analysis, from three angles of color, brightness, direction of spectrogram feature extraction; then studies the DBN network model is improved, and the use of traditional acoustic features and spectrum characteristics of feature fusion expansion of the scale, feature subset, enhance the emotional characterization ability. Through the ABC database and a Chinese database on the experimental verification, the new features of Zi Jixiang after feature fusion Than the traditional speech emotion features, the performance of cross database identification obtained significantly improved.5. was studied by using different languages and a large number of non cross database under SER model adaptive feature specific problems caused by the speaker. According to the previous chapters introduce the cross database of speech emotion recognition, distortion of the characteristic parameters, spectrum characteristics structure, comparative modeling algorithm, the thesis studies the relevant aspects of online adaptive optimization, and the performance of concrete are analyzed. Firstly, discusses the existing methods of adaptive speech emotion recognition. Then, for the further study of cross database, adaptive speaker additive feature distortion, and gives the model scheme then, for the influence of research on adaptive multi speaker problems to the SER system, carries on the modeling process, the Gauss mixture model and Student t distribution model Two statistical methods were compared and discussed. Then use their adaptation scheme to obtain the characteristic function including spectrogram feature set. In addition, also used the rapid optimization of the characteristic function of some online data. Finally, in four different languages on the database (including: German, English, Chinese and Vietnamese) verify the respective adaptation scheme is effective. The experimental results show that the adaptive scheme with improved speaker feature adaptive effect, especially in the treatment of a large number of unknown speaker cases showed that the migration model parameters with good ability. In addition, for the cross database in different languages on emotional characteristics are analyzed and discussed. From the angle of adaptive features.

【学位授予单位】：东南大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TN912.34

【相似文献】