监督性语音分离中训练目标的研究

发布时间：2018-04-26 03:05

本文选题：深度神经网络 + 语音分离　；参考：《内蒙古大学》2017年硕士论文

【摘要】：语音分离是指从带有噪声的混合语音信号中提取出需要的目标语音信号,应用于鲁棒性语音识别、助听器设计和移动语音通信等领域。当前的语音分离技术在真实场景中的语音分离性能仍有待进一步提升。语音分离问题按照通道数分为单通道和多通道语音分离,本文主要研究单通道语音分离问题。语音分离问题可以被看做一个监督性学习问题,通过监督性学习算法加以解决。而对于监督性语音分离算法,训练目标是其中的关键环节之一,对分离性能有着重要影响。目前最常用的训练目标有理想二值掩蔽和理想浮值掩蔽,二者都是在假设纯净语音与噪声相互独立的条件下成立,在真实场景中难以满足。而复数域上的理想浮值掩蔽和相敏掩蔽考虑了语音信号的相位信息,不易于估计,因而实际分离效果仍不甚理想。相较于这些常用的时频掩蔽,本文所采用的优化浮值掩蔽,考虑了纯净语音与噪声间的相关性,符合真实场景中语音分离的条件。本文将其与监督性语音分离技术相结合,以优化浮值掩蔽作为分离目标,提出了解决语音分离问题的新方案。本文在多种噪声环境和信噪比条件下进行了仿真实验,并与几种目前常用训练目标进行对比分析,实验结果表明,本文所提出的方法进一步改善了语音分离的效果,更加适用于真实场景中的语音分离问题。考虑到优化浮值掩蔽是基于纯净语音与噪声的相关性信息,本文中进一步对更具有挑战性的不同人声之间的语音分离做了仿真实验,实验结果表明本文提出的分离方法对于不同说话人语音的分离同样具有性能优势。单通道语音去混响问题也是语音信号处理领域的研究重点之一。近年随着深度学习的推进,研究者们将深度学习应用于语音去混响问题,取得了不错的效果。本文将提出的语音分离方法用于实现语音去混响,实验结果表明去混响效果得到了一定程度的提升。
[Abstract]:Speech separation is to extract the target speech signal from the mixed speech signal with noise, which can be used in the fields of robust speech recognition, hearing aid design and mobile speech communication. The performance of the current speech separation technology in real-time scene still needs to be further improved. The speech separation problem is divided into single channel and multi channel according to the number of channels. The problem of speech separation can be regarded as a supervised learning problem, which can be solved by supervised learning algorithm. For supervised speech separation algorithm, the training target is one of the key links, which has an important impact on the separation performance. At present, the most commonly used training targets are ideal binary masking and ideal floating masking, both of which are based on the assumption that pure speech and noise are independent of each other, which is difficult to satisfy in real scenes. However, the ideal floating value masking and phase sensitive masking in complex domain take into account the phase information of speech signal, which is difficult to estimate, so the actual separation effect is still not very good. Compared with these commonly used time-frequency masking, the optimized floating value masking is adopted in this paper, considering the correlation between pure speech and noise, which accords with the condition of speech separation in real scene. In this paper, a new method to solve the problem of speech separation is proposed by combining it with the supervised speech separation technology and taking the optimization of floating value masking as the separation target. In this paper, simulation experiments are carried out in a variety of noise environments and signal-to-noise ratio (SNR) conditions, and compared with several commonly used training targets. The experimental results show that the method proposed in this paper can further improve the effect of speech separation. It is more suitable for speech separation in real scene. Considering that the optimization of floating value masking is based on the correlation information between pure speech and noise, this paper makes further simulation experiments on the more challenging speech separation between different voices. Experimental results show that the proposed separation method has the same performance advantages for different speaker speech separation. The problem of single-channel speech dereverberation is also one of the focuses in the field of speech signal processing. In recent years, with the development of deep learning, researchers have applied deep learning to the problem of phonological reverberation, and achieved good results. In this paper, the proposed speech separation method is used to realize the speech de-reverberation. The experimental results show that the de-reverberation effect has been improved to a certain extent.
【学位授予单位】：内蒙古大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TN912.3

【相似文献】