基于稀疏编码的鲁棒说话人识别方法研究

发布时间:2018-07-26 10:50
【摘要】:说话人识别又称声纹识别,是一种通过语音确定说话人身份的技术。由于使用语音具有采集方便、成本低廉等优点,说话人识别被广泛用于生物认证、安全监控、军事侦查和金融交互等领域,具有广阔的应用前景。数十年来,世界各国的研究机构和公司企业纷纷投入大量人力物力展开研究,有力地推动了说话人识别技术的发展。目前说话人识别技术已逐步从实验室走向应用,而现实环境的复杂性对说话人识别提出了更高的要求,包括鲁棒性、实时性、识别率和稳定性等。这就要求在说话人识别关键环节上有所突破,尤其是语音活动检测、特征提取,以及说话人模型的构建等方面。目前的说话人识别技术在干净语音环境下有理想的识别率,但在噪声环境下,其性能会急剧降低,这阻碍了说话人识别技术走向现实应用。本文针对说话人识别技术缺乏噪声鲁棒性的问题,将稀疏编码技术用于说话人识别的各个环节,包括语音活动检测、语音特征提取和说话人建模等,提出了系统的解决方案,以提高说话人系统在噪声环境下的识别率,主要工作包括以下几个方面:首先,从理论上分析了两种稀疏编码方法对噪声的建模能力,为稀疏编码的应用奠定了基础。稀疏编码在对噪声的建模方面有两种方式:第一种用残差对噪声建模,噪声的理论模型是高斯白噪声,其内在的假定在于语音在语音字典上稀疏,而噪声在语音字典上不稀疏,白噪声在任何字典上都表现得不稀疏,满足了这一要求;第二种采用一个噪声字典对噪声建模,其内在假定在于语音和噪声在各自的字典上稀疏,且在自己的字典上比在对方的字典上更稀疏。本文从理论上分析了这两种稀疏编码方式重构信号时误差的上下限,然后用实验验证了理论分析的结论,表明当噪声不稀疏时,第一种方法和第二种方法的重构误差在理论上有相同的下限和不同的上限;当噪声也可能稀疏时,第二种方法增加了一个字典对噪声建模,融入了更多先验知识,其重构误差上限要低于第一种方法。然后,针对语音活动检测容易受到噪声影响的问题,基于稀疏编码构建噪声字典,提出了一种对噪声鲁棒的语音活动检测方法。语音活动检测是说话人识别的第一步,能减少算法处理的数据量,提高识别效率。目前的语音活动检测方法虽然也考虑了噪声,但只能解决噪声环境已知,且噪声环境不变的情况。当噪声环境发生改变,或者噪声不平稳,其性能将急剧降低。本文首先采用高斯混合模型识别噪声类型;然后将经过训练后的噪声字典与语音字典拼接成一个大字典;最后,将混噪语音稀疏表示在拼接后的大字典上,并用语音字典上的稀疏表示实现语音和非语音的判定。从结果上看,本文的方法实现了对噪声环境的感知,能有针对性地选择字典去适应噪声,在复杂噪声环境下取得了更好的识别效果。接下来,提出了两种对噪声不敏感的特征提取方法。特征提取是说话人识别中的关键环节。一方面我们要求特征具有区分性;另一方面,我们希望特征受到噪声的干扰尽可能地小。本文提出的第一种特征采用了感知最小方差无畸变响应技术,同时采用了平移差分倒谱算法,有效地融入了说话人语音的长时信息。所提取的特征不仅在干净环境下能取得良好性能,而且在混噪语音以及信道失配等声学条件下也优于目前主流的特征。在YOHO数据库和ROSSI数据库上的实验结果表明,该特征在噪声和信道畸变的情况下能有效提高识别系统的鲁棒性。第二种特征将混噪语音分解在语音字典上,然后用稀疏表示重构语音,并提取梅尔倒谱特征用于模型训练和识别。由于稀疏编码可以用残差或者用噪声字典对噪声建模,使得重构后的信号不含有噪声,因此能提取到受噪声影响小的语音特征。最后,提出了两阶段稀疏分解的说话人识别框架。目前的说话人识别方法普遍将所有说话人字典拼接在一起形成一个大字典,虽然具有一定的区分性,但是存在两方面问题。一方面拼接出来大字典原子数目过于庞大,降低了识别效率;另一方面,竞争的类别过多,稀释了真实说话人的竞争力。所提出的方法在第一阶段将待识别语音被分解到每个说话人字典上,然后通过重构计算残差,并对残差进行排序后,选取一个包含真实说话人的字典子集;第二阶段将新字典子集拼接成一个大字典,再次将待识别语音分解到大字典上,用各字典上的稀疏表示计算得分后识别说话人。这种结构在第一阶段去除了大量无关说话人字典,减少了算法的时间复杂度;第二阶段采用区分式识别方法确保了识别率。实验表明,本文所提出的两阶段稀疏分解方法既提高了识别速度,又提高了准确率。
[Abstract]:Speaker recognition is also known as sound pattern recognition. It is a technology to determine the speaker's identity through speech. Because of its advantages of convenient acquisition and low cost, speaker recognition has been widely used in the fields of biometric authentication, security monitoring, military investigation and financial interaction. The research institutions and companies have invested a lot of human and material resources to promote the development of speaker recognition technology. At present, speaker recognition technology has been gradually applied from the laboratory, and the complexity of the real environment has put forward higher requirements for speaker recognition, including robustness, real-time, recognition rate and stability. This requires some breakthroughs in the key link of speaker recognition, especially the detection of voice activity, feature extraction, and the construction of the speaker model. The present speaker recognition technology has an ideal recognition rate in a clean voice environment, but in the noisy environment, the ability of the speaker can be reduced sharply, which hinders the speaker recognition technology. In view of the fact that the speaker recognition technology is not robust to noise, this paper applies the sparse coding technique to each link of speaker recognition, including speech activity detection, speech feature extraction and speaker modeling, and proposes a system solution to improve the recognition rate of the speaker system in the noisy environment. The work includes the following aspects: firstly, the modeling ability of two sparse coding methods for noise is analyzed theoretically, which lays the foundation for the application of sparse coding. There are two ways to model the noise by sparse coding: the first model is modeled with residual to noise, and the theoretical model of noise is Gauss white noise, the inherent assumption is that The speech is sparse in the speech dictionary, and the noise is not sparse in the speech dictionary. White noise is not sparse in any dictionary and satisfies this requirement; the second uses a noise dictionary to model the noise, the inherent assumption is that the speech and noise are sparse on their respective dictionaries, and in their dictionaries they are compared to the other's dictionaries. In this paper, the upper and lower bounds of the error of the two sparse coding methods are analyzed theoretically, and the results of the theoretical analysis are verified by experiments. It is shown that when the noise is not sparse, the reconstruction error of the first method and the second method has the same lower limit and the different upper limit in theory; when the noise is also likely to be sparse, it may be sparse. The second method adds a dictionary to noise modeling and integrates more prior knowledge. The upper limit of its reconstruction error is lower than the first method. Then, in view of the problem that speech activity detection is easily affected by noise, a noise dictionary based on sparse coding is constructed, and a speech activity detection method for noise robust is proposed. Detection is the first step of speaker recognition, which can reduce the amount of data processed by the algorithm and improve the recognition efficiency. Although the current speech activity detection method also takes into account the noise, it can only solve the condition that the noise environment is known and the noise environment is constant. When the noise environment changes, or the noise is not stable, the performance will be reduced sharply. This paper is the first in this paper. First, the Gauss hybrid model is used to identify the noise type; then the trained noise dictionary and the speech dictionary are spliced into a large dictionary. Finally, the sparse speech is sparse expressed on the large dictionary after the splicing, and the speech and non speech recognition is realized by the sparse representation on the speech dictionary. The perception of noise environment can select the dictionary to adapt to the noise and obtain better recognition effect in the complex noise environment. Then, two kinds of feature extraction methods for noise insensitivity are proposed. Feature extraction is the key link in speaker recognition. On the one hand, we require distinguishing features; on the other hand, I We hope that the characteristics of the noise are as small as possible. The first feature proposed in this paper uses a perceptual minimum variance distortion free response technique and a translation differential cepstrum algorithm, which effectively integrates the long time information of the speaker's speech. The extracted feature can not only achieve good performance in a dry environment, but also be mixed with noise. The experimental results on the YOHO database and the ROSSI database show that the feature can effectively improve the robustness of the recognition system in the case of noise and channel distortion. The second features decompose the noisy speech into the speech dictionary, and then use the sparse representation to reconstruct the speech. Speech, and extract the features of Mel cepstrum for model training and recognition. Because the sparse coding can be used to model the noise with residual or noise dictionary, the reconstructed signal does not contain noise, so it can extract the speech feature affected by noise. Finally, a speaker recognition framework for two order segment sparse decomposition is proposed. The speaker recognition method generally joins all the speaker's dictionaries together to form a large dictionary, although it has a certain distinction, there are two problems. On the one hand, the number of large dictionaries is too large to reduce the recognition efficiency; on the other hand, the competitive category is too much, which dilutes the competitiveness of the real speaker. In the first stage, the speech is decomposed to each speaker's dictionary in the first stage, and then the residual error is calculated by reconstructing the residual, and after sorting the residuals, a dictionary subset containing the real speaker is selected. The second stage splicing the new dictionary subsets into a large dictionary, and then decomposing the recognized speech to the large dictionary again, using the different words. The sparse representation on the dictionary identifies the speaker after calculating the score. This structure removes a large number of unrelated speaker's dictionaries in the first stage and reduces the time complexity of the algorithm. The second stage uses a regional segmentation method to ensure the recognition rate. The experiment shows that the two step sparse decomposition method proposed in this paper not only improves the recognition speed, but also improves the recognition rate. The accuracy is improved.
【学位授予单位】:哈尔滨理工大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TN912.34

【参考文献】

相关硕士学位论文 前1条

1 刘婷婷;基于因子分析的与文本无关的说话人辨认方法研究[D];中国科学技术大学;2014年



本文编号:2145770

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2145770.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户13fb6***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com