基于联合因子分析的耳语音说话人识别研究

发布时间：2018-08-01 17:13

【摘要】：说话人识别，作为生物特征识别的重要组成部分，可广泛应用于公安司法、生物医学工程、军队安全系统等领域。随着计算机和网络技术的迅速发展，说话人识别技术已取得了长足的进步。耳语发音方式是一种特殊的语音交流形式，在很多场合应用。由于耳语音与正常音之间存在较大差异，耳语方式下说话人识别无法照搬正常音说话人识别的方法，尚有很多问题亟待解决。本文以与文本无关的耳语说话人识别为研究对象，进行了较为深入的探索。耳语音说话人识别所面临的问题主要包括：耳语数据库的不完善，对于正常语音，美国国家标准技术局给出了统一的数据库资源用于开展说话人识别研究，而耳语音在这方面的资源较为匮乏；耳语音特征表达问题，耳语音由于其发音的特殊性，有些常用的特征参数无法提取，其频谱参数的获取较正常音也更加困难；耳语音是气声发音，声级较低，较易受噪声干扰，，且耳语音往往在手机通话时使用，易受信道环境影响；同时，耳语发音时，受发音场所制约，情感表达受限，且发音状态、心理因素都会产生一定的变化，更易受到说话人心理因素、情绪及发音状态的影响。因此，较之正常发音，耳语发音方式下说话人识别面临的主要难点是：特征参数更难提取，易受说话人自身状态影响，以及对信道变化更加敏感等。针对这些问题，本文开展了以下几个方面的工作： 1.提出了反映耳语音说话人特征的参数提取算法。耳语音无基频、声源特征难以体现，作为表征声道特性的共振峰参数，其提取算法的可靠性显得尤为重要。本文提出了基于频谱分段的耳语音共振峰提取算法，该方法可动态地进行频谱分段，通过选择性线性预测获得滤波器参数，采用并联的逆滤波控制得到共振峰。该方法为解决因耳语发音导致的共振峰偏移、合并、平坦等问题提供了有效途径。另一方面，本文依据变量统计里中心与平坦度可衡量信号稳定性的特点，结合人耳听觉模型，提出了Bark子带谱中心与Bark子带谱平坦度的概念，与其他频谱变量组成特征参数集，可有效表征耳语发音方式下说话人特征。 2.提出了基于特征映射及说话人模型合成的非典型情绪下耳语说话人识别方法。较好地解决训练语音与测试语音说话人情绪状态失配的问题。由于耳语音在情绪表达方面不如正常音有效，无法明晰地进行情感分类，本文通过耳语音说话人状态的A、V因子分类方法，模糊其情感表达的一一对应性，并在测试阶段，作为语音信号的前端处理手段，对每一段语音进行说话人状态分辨，而后实现特征域或模型域的补偿。实验表明，基于特征映射及说话人模型合成的说话人状态补偿方法不仅体现了耳语音的独特性，更能有效提高非典型情绪下耳语音说话人识别的正确率。 3.提出了基于潜因子分析的非典型情绪下耳语说话人识别方法。为耳语说话人状态补偿提供了有效的途径。因子分析不关注公共因子所代表的具体物理含义，仅是在众多变量中找出具有代表性的因子，且可通过因子数目的增减，调节算法的复杂度。根据潜因子理论，可将耳语音特征超矢量分解为说话人超矢量与说话人状态超矢量，通过均衡的训练语音分别估计说话人与说话人状态空间，并在测试阶段，对每一段语音估计其说话人因子，进而做出判决。潜因子分析方法规避了测试环节中的说话人状态分类，相较于对分类方法有依赖性的补偿算法，可进一步提升耳语说话人识别率。 4.提出了基于联合因子分析的多信道下非典型情绪耳语音说话人识别方法。实现了耳语音说话人识别中的信道与说话人状态双重补偿。根据联合因子分析的基本概念，可将语音特征超矢量分解为说话人超矢量、说话人状态超矢量以及信道超矢量。针对因耳语音训练数据不充分，无法同时估计出说话人、说话人状态及信道空间的问题，用联合因子分析方法，在获得UBM模型后，计算语音的Baum-Welch统计量，并首先估计说话人空间，而后采用并行模式分别估计说话人状态及信道空间。测试阶段，对于语音的特征矢量减去信道及说话人状态偏移，变换后的特征用于说话人识别。实验结果表明，基于联合因子分析的方法可同时进行信道及说话人状态补偿，相较于其他算法，可获得更好的识别效果。
[Abstract]:Speaker recognition, as an important part of biometric recognition, can be widely used in public security and judicature, biomedical engineering, military security system and other fields. With the rapid development of computer and network technology, speaker recognition technology has made great progress. Ear whisper is a special form of voice communication, in many cases Because there is a great difference between the ear and the normal sound, the speaker recognition can not copy the method of the normal speaker recognition in the ear language. There are still many problems to be solved.
In this paper, the research object of the ear language speaker recognition is not related to the text. The problems facing the ear speaker recognition mainly include: the imperfect ear language database, the normal voice, the United States National Standard Technology Bureau, which is used to carry out the speaker recognition research, and the ear is used to carry out the speaker recognition research. The resources of speech are scarce, the problem of ear speech feature expression, the ear speech because of its particularity, some commonly used characteristic parameters can not be extracted, its spectrum parameters are more difficult to obtain than normal sound, ear pronunciation is gas sound, low sound level, easier to be disturbed by noise, and ear speech is often in mobile phone calls. It is easy to be affected by the channel environment; at the same time, when the ear language is pronounced, it is restricted by the place of pronunciation, the expression of emotion is limited, and the state of the pronunciation, the psychological factors will have some changes, and it is more susceptible to the influence of the speaker's psychological factors, emotion and pronunciation state. The point is: the feature parameters are more difficult to extract, and are easily affected by the speaker's own state, and are more sensitive to the channel changes.
In view of these problems, this paper has carried out the following aspects:
1. a parameter extraction algorithm which reflects the characteristics of the speaker's speech speaker is proposed. The ear speech has no fundamental frequency and the sound source features are difficult to embody. As a resonance peak parameter that characterizing the characteristics of the sound channel, the reliability of the extraction algorithm is particularly important. In this paper, a spectral segmentation algorithm for the ear speech resonance peak extraction is proposed. This method can dynamically divide the spectrum. The filter parameters are obtained by selective linear prediction, and the resonant peak is obtained by parallel inverse filtering. This method provides an effective way to solve the problem of resonance peak migration, merger and flatness caused by the ear speech sound. On the other hand, this paper combines the characteristics of the center and flatness of the variable statistics to measure the stability of the signal. The concept of the spectral flatness of the Bark subband spectrum center and the Bark subband spectrum is proposed in the human ear auditory model, and the feature parameter sets are formed with other spectrum variables, which can effectively characterize the speaker's characteristics in the ear speech sound mode.
2. an atypical speech speaker recognition method based on feature mapping and speaker model synthesis is proposed. It can solve the problem of emotional state mismatch between the training speech and the test speech speaker. Because the ear speech is not as effective as the normal sound in emotional expression, it can not make a clear emotional classification. The A, V factor classification method of the speaker state blurs the one-to-one correspondence of its emotional expression, and at the test stage, as the front end processing method of the speech signal, the speaker States each speech state, and then the compensation of the feature domain or the model domain is realized. The experiment shows that the speaker state is based on the feature mapping and the speaker model. The compensation method not only embodies the uniqueness of whispered speech, but also can effectively improve the accuracy of speaker recognition in atypical emotional whispered speech.
3. an ear whisper recognition method based on the latent factor analysis is proposed. It provides an effective way for the ear speaker state compensation. The factor analysis does not pay attention to the specific physical meaning represented by the public factors. It is only to find representative factors in many variables, and can be adjusted and reduced by the number of factors. According to the latent factor theory, the super vector of the ear speech feature can be decomposed into the speaker's super vector and the speaker's state super vector, and the speaker and speaker's state space is estimated by the balanced training speech. In the test stage, the speaker factor is estimated for each speech, and then the decision is made. The submersible factor analysis method is made. Compared with the compensation algorithm which is dependent on the classification method, the speaker recognition rate can be further improved.
4. an untypical emotional ear speaker recognition method based on joint factor analysis is proposed. The dual compensation of the channel and speaker state in the speech speaker recognition is realized. According to the basic concept of the joint factor analysis, the speech feature supervector can be decomposed into the speaker supervector, the speaker state supervector and the speaker state supervector. In order to solve the problem of speaker, speaker state and channel space at the same time, the speaker state and channel space can not be estimated at the same time. After obtaining the UBM model, the Baum-Welch statistics of speech are calculated, the speaker space is estimated and the speaker state is estimated by parallel mode, and then the speaker state is estimated by parallel mode. Channel space. In the test phase, the characteristics of the speech feature vectors are subtracted from the channel and speaker state offset, and the transformed features are used for speaker recognition. The experimental results show that the method based on the joint factor analysis can compensate the channel and speaker state at the same time, and better recognition results can be obtained compared with other algorithms.
【学位授予单位】：苏州大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TN912.34

【参考文献】