RNN-BLSTM声学模型的说话人自适应方法研究

发布时间：2018-06-18 16:22

本文选题：语音识别 + 说话人自适应　；参考：《中国科学技术大学》2017年硕士论文

【摘要】：说话人自适应技术利用特定说话人提供的语料,让语音识别系统在识别性能上针对该说话人有明显的提升。它可以将说话人无关的识别系统转换成说话人相关的识别系统,从而和说话人相关的声学特征相匹配;也可以将说话人相关的声学特征转换成说话人无关的声学特征,从而和说话人无关的识别系统相匹配。因此,说话人自适应技术是为了让说话人和识别系统尽量匹配。基于双向长短时记忆单元的递归神经网络(recurrent neural network with bidi-rectional Long Short-Term Memory,RNN-BLSTM)声学模型不仅针对语音的时序进行建模,而且利用一些控制器来控制信息流,从而解决了传统的基于递归神经网络声学模型的梯度爆炸和梯度消失问题。同时,在一些语音标准数据集上基于RNN-BLSTM声学模型的语音识别系统相比于深度神经网络(Deep Neural Networks,DNN)获得了超过10%的性能提升。虽然RNN-BLSTM声学模型在识别性能上相比于DNN有了大幅度的提升,但是依旧不能够解决上述的不匹配问题。因此,在RNN-BLSTM声学模型上进行说话人自适应技术的研究尤为重要。本文主要围绕RNN-BLSTM声学模型上的说话人自适应展开研究。首先,本文将基于说话人编码(speaker code)的说话人自适应方法应用于RNN-BLSTM声学模型,并分析RNN-BLSTM的记忆单元(memory cell)中的不同控制器对说话人自适应的识别性能的影响。与此同时,我们还提出一些启发式的算法来对基于speaker code的方法进行优化和改进,从而进一步地提升识别性能。然后,本文提出了基于深层编码(deep code,d-code)的离线说话人自适应方法,该方法提供了一种解决基于speaker code的说话人自适应方法的二遍解码问题的途径。通过实验对比,该方法在识别性能上与基于speaker code的方法相接近,并且比同样不需要二遍解码的基于鉴别性矢量(identity vector,i-vector)的说话人自适应方法在识别性能上更优,训练过程更加灵活。最后,本文研究基于d-code的在线说话人自适应方法,该方法不需要收集说话人整个句子。它在在线的语音识别过程中逐步进行说话人自适应,并取得了较好的识别效果。
[Abstract]:The speaker adaptive technology makes use of the corpus provided by a specific speaker to improve the recognition performance of the speech recognition system. It can convert the speaker independent recognition system into the speaker related recognition system, which can match the speaker related acoustic feature, and can also convert the speaker related acoustic feature into the speaker independent acoustic feature. Thus matching with the speaker independent recognition system. Therefore, the speaker adaptation technique is to make the speaker and recognition system match as much as possible. Recurrent neural network with bidi-rectional long term memory (RNN-BLSTM) acoustic model based on bidirectional long short term memory unit not only models the speech time series, but also uses some controllers to control the information flow. The problem of gradient explosion and gradient disappearance based on recurrent neural network acoustic model is solved. At the same time, the speech recognition system based on the RNN-BLSTM acoustic model on some speech standard data sets has achieved a performance improvement of more than 10% compared with the deep neural network (Deep Networks / DNNN). Although the recognition performance of RNN-BLSTM acoustic model is much better than that of DNN, it can not solve the mismatch problem mentioned above. Therefore, it is very important to study the speaker adaptive technology on the RNN-BLSTM acoustic model. This paper focuses on the speaker adaptation on the RNN-BLSTM acoustic model. Firstly, the speaker adaptive method based on speaker code is applied to the RNN-BLSTM acoustic model, and the influence of different controllers in the memory cell of RNN-BLSTM on the speaker adaptive recognition performance is analyzed. At the same time, we propose some heuristic algorithms to optimize and improve the method based on speaker code, so as to further improve the performance of recognition. Then, this paper presents an offline speaker adaptation method based on deep codec, which provides a way to solve the second pass decoding problem of speaker adaptive method based on speaker code. The experimental results show that the performance of this method is similar to that of the one based on speaker code, and it is better than the speaker adaptive method based on discriminant vector identity vectori-vector-based which also does not need to be decoded twice, and the training process is more flexible. Finally, this paper studies the online speaker adaptation method based on d-code, which does not need to collect the whole sentence of the speaker. In the process of online speech recognition, speaker adaptation is carried out step by step, and good recognition effect is achieved.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TN912.34

【相似文献】