基于卷积神经网络的鲁棒性基音检测方法研究

发布时间：2018-10-26 14:31

【摘要】：语音是语言符号系统中信息的载体,是日常生活中应用最普遍的交流媒介。在语音信号中,基音是一个关键的特征,且发挥不可替代的作用,已广泛应用于语音合成、语音识别等领域。精确而高效地提取语音基音直接影响着语音识别的准确率、语音合成的自然度以及语音分离的清晰度等。目前,在纯净语音环境中提取语音基音已经取得了不错的效果,然而,在噪声环境下,由于谐波结构被严重破坏,检测噪声环境中语音的基音仍然是一项难度较大的工作。本文提出使用卷积神经网络(Convolutional Neural Network, CNN)来完成这项工作。CNN具有位移不变性,通过卷积核的移动,能够更好地刻画语谱中的谐波结构。在具体的实现中,本文使用CNN来选取候选基音,然后考虑到语音信号的连续性,再用动态规划(Dynamic Programing, DP)方法进行基音追踪,生成连续的基音轮廓。在相同的数据集上用不同的方法进行对比实验。实验结果表明,与其它方法相比,本文的方法具有明显的性能优势,能够得到较高的基音检测率(Detection Rate, DR)和较低的错误决策率(Voice Decision Error, VDE):与深度神经网络(Deep Neutral Network, DNN)、非线性幅度压缩法(以下简称'PEFAC')和Jin and Wang(以下简称‘Jin’)相比,本文提出的方法,DR平均分别提升了5.58%、5.75%和16.41%；VDE则分别下降了1.91%、4.25%和10.04%,该方法对新的说话人和噪声有很好的泛化性能,具有更好的鲁棒性。并且随着测试集与训练集的相似性逐渐变小,我们所提出方法的优势也越来越明显。
[Abstract]:Speech is the carrier of information in language symbol system and the most common communication medium in daily life. Pitch is a key feature in speech signal and plays an irreplaceable role. It has been widely used in speech synthesis, speech recognition and other fields. Accurate and efficient speech pitch extraction directly affects the accuracy of speech recognition, the naturalness of speech synthesis and the clarity of speech separation. At present, the extraction of speech pitch in pure speech environment has achieved good results. However, in the noise environment, because the harmonic structure is seriously damaged, it is still a difficult task to detect the pitch in the noise environment. In this paper, a convolutional neural network (Convolutional Neural Network, CNN) is proposed to accomplish this work. CNN is displacement-invariant. By moving the convolution kernel, it can better describe the harmonic structure in the linguistic spectrum. In the implementation, we use CNN to select candidate pitch, then consider the continuity of speech signal, and then use dynamic programming (Dynamic Programing, DP) method to track pitch to generate continuous pitch contour. In the same data set, different methods are used to carry out the contrast experiment. The experimental results show that compared with other methods, the proposed method has obvious performance advantages and can obtain higher pitch detection rate (Detection Rate, DR), lower error decision rate (Voice Decision Error, VDE):) and depth neural network (Deep Neutral Network,). Compared with the DNN), nonlinear amplitude compression method ('PEFAC') and Jin and Wang (' Jin', the average DR increases by 5.58% and 16.41%, respectively. VDE decreased by 1.91% and 10.04% respectively. This method has better generalization performance and better robustness to the new speaker and noise. As the similarity between the test set and the training set becomes smaller, the advantages of the proposed method become more and more obvious.
【学位授予单位】：内蒙古大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TN912.3;TP183

【相似文献】