基于语音频率特性抑制音素影响的说话人特征提取

发布时间：2018-03-24 10:17

本文选题：说话人辨认　切入点：音素的个人信息分布　出处：《天津大学》2014年博士论文

【摘要】：语音具有语言信息与个人信息；语言信息表示说话人的共性特征，个人信息表示说话人个性特征。进行说话人识别时，需要保存说话人个性信息并同时抑制语言信息。然而，语音信号的说话人个性信息与语言信息很难分开。为了减小发音内容之间差异对说话人识别的影响，本文提出了音素影响抑制（PhonemeEffect Suppression，PES）法，以便强调说话人个人信息的差异。为了得到在频域上说话人信息的准确分布，本文首先研究了语音频率特性。我们通过得到每个音素在各个子频带上对说话人个性信息的贡献率（PhonemeF-ratio Contribution，PFC），提出了在不同音素的说话人信息的分布。语音受到人的发声器官、发音方式与发音位置的影响。所以在每个音素的说话人信息的分布反映特定生理发音器官与发音方式的个性。本文在三种语言（英语、汉语与朝鲜语）上分别研究了说话人个人信息的声学表达。通过测试每个音素在各个子频带上对说话人个性信息的贡献率，发现浊音、清音和鼻音的都具有不同的说话人个性信息的分布。在此基础上，本文提出了PES方法，抑制了不同音素对说话人个性的影响，得出了说话人个人信息在频域上的分布（Phoneme Effect Suppressed SpeakerInformation Distribution，PES-SID）。最后，本文提出了一种提取说话人特征的新方法，此方法专注于基于说话人个人信息分布的非均匀频率尺度的表示。本文提出的说话人特征用于GMM说话人模型并进行了说话人辨认实验，并与另外两种说话人特征作了对比。实验结果表明我们提出的特征优于其他两种特征。与MFCC（Mel Frequency CepstrumCoefficient）特征相比，对于不同的语言，我们提出的特征都降低了识别错误率：对于英语降低了61.1%，对于朝鲜语68.0%，对中文32.9%。与FFCC（F-ratioFrequency Cepstrum Coefficient）相比，我们的错误率降低了：30%（英语），，28.5%（朝鲜语），6.6%（中文）。这些结果表明，本文提出的特征对于不同的语言也具有一定的说话人鉴别鲁棒性。
[Abstract]:Speech has language information and personal information; language information represents the common characteristics of the speaker and personal information represents the individual characteristics of the speaker. In the process of speaker recognition, it is necessary to preserve the speaker's personality information and suppress the language information at the same time. It is difficult to separate the speaker's personality information from the language information of the speech signal. In order to reduce the influence of the difference between the pronunciation contents on the speaker's recognition, this paper proposes a phoneme influence suppression method (PhonemeEffect support expression) to emphasize the difference of the speaker's personal information. In order to obtain the accurate distribution of speaker information in frequency domain, In this paper, we first study the frequency characteristics of speech. By obtaining the contribution rate of each phoneme to the speaker's personality information in each subband, we propose the distribution of speaker information in different phonemes. Therefore, the distribution of speaker information in each phoneme reflects the personality of specific physiological organs and patterns of pronunciation. In this paper, three languages (English, English, English, English, English, English, English, English, English, English, English, English, English, English, English, English, etc.). The acoustic expression of the speaker's personal information was studied in Chinese and Korean respectively. By testing the contribution rate of each phoneme to the speaker's personality information in each sub-band, we found the voiced sound. Clear tone and nasal sound have different distribution of speaker's personality information. On this basis, the PES method is proposed to suppress the influence of different phonemes on the speaker's personality, and the distribution of the speaker's personal information in the frequency domain is obtained. Finally, a new method for extracting speaker features is proposed. This method focuses on the representation of non-uniform frequency scales based on the distribution of personal information of the speaker. The speaker feature proposed in this paper is used in the GMM speaker model and the speaker recognition experiment is carried out. Compared with the other two speaker features, the experimental results show that the proposed feature is superior to the other two features. All the features we proposed reduced the rate of recognition errors: 61.1 for English, 68.0 for Korean, 32.9for Chinese. Compared with FFCC(F-ratioFrequency Cepstrum efficiency, our error rate was lower than that of FFCC(F-ratioFrequency. The proposed features are also robust to speaker identification for different languages.
【学位授予单位】：天津大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TN912.3

【参考文献】