基于听觉计算模型和深度神经网络的双耳语音分离

发布时间：2018-03-24 03:14

本文选题：双耳通道语音分离　切入点：回归神经网络　出处：《中国科学技术大学》2017年硕士论文

【摘要】：语音是人们最重要的交流方式之一。由于日常生活环境中噪声的存在,以及信道传输损失等等因素,语音质量往往会受到影响,我们所接收到的语音中所包含的信息也会大打折扣,因此如何从带噪语音中分离出干净的语音,与人们的日常生活息息相关。故语音分离技术成为语音信号处理中一个重要研究方向。在过去的几十年中,传统的语音分离方法已经有了丰富的研究,例如谱减法,维纳滤波法等。但是传统的语音分离方法对语音和干扰的特性所做的一些假设,在实际生活中可能并不能得到满足,因此也使得其在实际应用场景中的效果大打折扣,比如会使得分离出的语音带有"音乐噪声干扰"等。近年来听觉场景分析这一方法也越来越多地得到人们的重视和研究。该方法受人耳听觉处理系统的启发,通过对语音提取出有效的"场景线索"来进行语音的分离。而基于计算机软件来实现对语音的场景分析和分离方面的研究也方兴未艾。但是目前基于分类神经网络的听觉场景分析方法,虽然能够有效地提高分离后语音的信噪比,但是却没有很好地保证语音的听感,使得语音存在一些不连续性的问题。为此,在本文中,我们重点研究了如何利用深度神经网络来进行语音分离,并改善听感上的不自然的缺点;并基于计算听觉场景分析理论,针对双耳通道语音信号提取出有效的"场景线索",提高模型在带噪环境下的分离性能;通过对人耳听觉计算模型的探索,在听觉皮层感知域层面提取出具有模拟人耳听觉特性的特征,改善语音分离效果。首先,我们提出了一种基于回归神经网络的双耳通道语音分离方法。与分类神经网络进行时频单元的分类和重组不同,我们利用神经网络强大的信息提取和建模能力,直接从输入的带噪语音中估计出干净的目标语音。通过选择网络的学习目标以及最小化均方误差的准则,使得最终估计出的语音特征在时域和频域上都保留了很好的连续性和自然度。实验结果表明基于回归神经网络的分离方法能很大程度地提升分离后语音的听感。其次,在回归模型的基础上,基于听觉场景分析理论,我们提出了一种基于对数能量谱的双通道特征表示方法。在传统的对数能量谱特征上,我们针对双耳通道信息的特点,设计了基于频点和时间的全频带互能量差异性特征和低维度的全局互能量差异性特征。为了使特征在包含足够信息量的同时不至于因维度过高而引入过多参数,我们设计了子频带互能量差异性特征。实验结果表明我们设计的双通道能量差异性特征有效地利用了双耳通道信息,较好地提升了分离效果,且基于子频带互能量差异性特征的系统性能更优。最后,通过对听觉计算模型领域的学习,我们提出了基于听觉皮层时频感知域特征的语音分离方法。通过对已有的数学模型的研究,我们针对双耳通道语音设计了模拟时频感知域特性的二维滤波器。此外针对时频感知域特征的维度过高问题,我们提出并采用了多种特征降维方式。比如单通道中的频域平均的方法和主成份分析的方法。在提取双通道"线索"时,我们设计了时频感知域能量差特征,并使用了全局加权和和分区加权和的降维方式。使得双通道特征在尺度组合上能达到最优,另外还设计了分频带加权和方法,使得双通道特征在尺度组合上和不同频带上都能达到最优。通过模型对加权系数的学习,我们最终得到了一套有效的降维的时频感知域能量差特征。实验结果表明自动学习的特征组合方式能更有效地提升模型的分离效果。
[Abstract]:Speech is one of the most important means of communication. People in daily life due to the noise existing in the environment, and the transmission loss of voice quality often affected, contained in speech we received the information will be greatly reduced, so from a noisy speech separation of clean speech, and is closely related to people in the daily life. So the speech separation technology becomes an important research direction of processing of speech signal. In the past few decades, the traditional speech separation method has a wealth of research, such as spectral subtraction, Wiener filters and so on. But some of the assumptions made characteristics of speech separation method of traditional voice and interference in real life, may not be satisfied, so in the practical application scenarios in effect, such as the voice separated with "music noise Acoustic interference ". In recent years, auditory scene analysis this method is also more and more attention and research. The method is inspired by the human auditory system, through extracting the effective separation" "Scene clues for speech to speech. Based on computer software to realize the research of scene analysis and separation the speech is just unfolding. But the current classification of neural network based on auditory scene analysis method, although after separation can effectively improve the SNR of the speech, but did not guarantee a good sense of hearing speech, the speech has some continuity problems. Therefore, in this paper, we focus on the study of how to use the depth of the neural network to improve the auditory and speech separation, not natural disadvantages sense; and based on computational auditory scene analysis theory, aiming at the binaural channel speech signal is extracted effectively "The scene clues", to improve the separation performance of model to noisy environment; through the exploration of computational models of human auditory feature extraction, with simulation human auditory characteristicsmethod in the auditory cortex perceptual level, improve speech separation effect. Firstly, we propose a binaural channel speech based on recurrent neural network separation method. The classification and reorganization of time-frequency unit and classification of different neural networks, we use information extraction and modeling ability of neural network robust, estimated directly from the input of the target speech clean and noisy speech. By choosing the network learning goals and minimizing the mean square error criterion, the final estimate of speech feature in the time domain and frequency domain have retained the good continuity and naturalness. The experimental results show that the separation method based on recurrent neural networks can greatly enhance the separation of speech The sense of hearing. Secondly, in the regression model based on auditory scene analysis based on the theory, we propose a representation method of double channel characteristics logarithm based on energy spectrum. In the traditional logarithmic energy spectral features, we address binaural channel information characteristics, design of a global full band frequency and time of mutual energy differences in characteristics and low dimension based on mutual energy difference. In order to make the features contain enough information quantity at the same time not because of high dimension of introducing too many parameters, we design the subband mutual energy difference. The experimental results of double channel energy difference characteristic that we design the effective use of the ears the channel of information, improve the separation efficiency, and system performance based on sub-band energy mutual difference characteristic better. Finally, the calculation model of field of auditory learning, we propose based on Speech frequency domain feature separation method. When the auditory cortex by studying the mathematical model of the two-dimensional frequency domain filter characteristics for binaural perception we designed analog channel speech. In addition to time-frequency domain features of the high dimension of perceived problems, we propose several dimensionality reduction methods. Analysis methods such as the average single channel method in frequency domain and principal component extraction. In the dual channel "clues", we design the time-frequency domain characteristics and perceived energy difference, the use of dimensionality reduction methods and global weighted and weighted. The characteristics of dual channel partition optimal scale in combination, also designed the frequency with the weighted sum method, the double channel characteristics can achieve optimal scale in combination and different frequency. Through the model of weighted coefficient learning, we finally got a set of effective dimension reduction. The experimental results show that the feature combination method of automatic learning can improve the separation effect of the model more effectively.

【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TN912.3

【相似文献】