
发布时间:2018-07-26 10:50
[Abstract]:Speaker recognition is also known as sound pattern recognition. It is a technology to determine the speaker's identity through speech. Because of its advantages of convenient acquisition and low cost, speaker recognition has been widely used in the fields of biometric authentication, security monitoring, military investigation and financial interaction. The research institutions and companies have invested a lot of human and material resources to promote the development of speaker recognition technology. At present, speaker recognition technology has been gradually applied from the laboratory, and the complexity of the real environment has put forward higher requirements for speaker recognition, including robustness, real-time, recognition rate and stability. This requires some breakthroughs in the key link of speaker recognition, especially the detection of voice activity, feature extraction, and the construction of the speaker model. The present speaker recognition technology has an ideal recognition rate in a clean voice environment, but in the noisy environment, the ability of the speaker can be reduced sharply, which hinders the speaker recognition technology. In view of the fact that the speaker recognition technology is not robust to noise, this paper applies the sparse coding technique to each link of speaker recognition, including speech activity detection, speech feature extraction and speaker modeling, and proposes a system solution to improve the recognition rate of the speaker system in the noisy environment. The work includes the following aspects: firstly, the modeling ability of two sparse coding methods for noise is analyzed theoretically, which lays the foundation for the application of sparse coding. There are two ways to model the noise by sparse coding: the first model is modeled with residual to noise, and the theoretical model of noise is Gauss white noise, the inherent assumption is that The speech is sparse in the speech dictionary, and the noise is not sparse in the speech dictionary. White noise is not sparse in any dictionary and satisfies this requirement; the second uses a noise dictionary to model the noise, the inherent assumption is that the speech and noise are sparse on their respective dictionaries, and in their dictionaries they are compared to the other's dictionaries. In this paper, the upper and lower bounds of the error of the two sparse coding methods are analyzed theoretically, and the results of the theoretical analysis are verified by experiments. It is shown that when the noise is not sparse, the reconstruction error of the first method and the second method has the same lower limit and the different upper limit in theory; when the noise is also likely to be sparse, it may be sparse. The second method adds a dictionary to noise modeling and integrates more prior knowledge. The upper limit of its reconstruction error is lower than the first method. Then, in view of the problem that speech activity detection is easily affected by noise, a noise dictionary based on sparse coding is constructed, and a speech activity detection method for noise robust is proposed. Detection is the first step of speaker recognition, which can reduce the amount of data processed by the algorithm and improve the recognition efficiency. Although the current speech activity detection method also takes into account the noise, it can only solve the condition that the noise environment is known and the noise environment is constant. When the noise environment changes, or the noise is not stable, the performance will be reduced sharply. This paper is the first in this paper. First, the Gauss hybrid model is used to identify the noise type; then the trained noise dictionary and the speech dictionary are spliced into a large dictionary. Finally, the sparse speech is sparse expressed on the large dictionary after the splicing, and the speech and non speech recognition is realized by the sparse representation on the speech dictionary. The perception of noise environment can select the dictionary to adapt to the noise and obtain better recognition effect in the complex noise environment. Then, two kinds of feature extraction methods for noise insensitivity are proposed. Feature extraction is the key link in speaker recognition. On the one hand, we require distinguishing features; on the other hand, I We hope that the characteristics of the noise are as small as possible. The first feature proposed in this paper uses a perceptual minimum variance distortion free response technique and a translation differential cepstrum algorithm, which effectively integrates the long time information of the speaker's speech. The extracted feature can not only achieve good performance in a dry environment, but also be mixed with noise. The experimental results on the YOHO database and the ROSSI database show that the feature can effectively improve the robustness of the recognition system in the case of noise and channel distortion. The second features decompose the noisy speech into the speech dictionary, and then use the sparse representation to reconstruct the speech. Speech, and extract the features of Mel cepstrum for model training and recognition. Because the sparse coding can be used to model the noise with residual or noise dictionary, the reconstructed signal does not contain noise, so it can extract the speech feature affected by noise. Finally, a speaker recognition framework for two order segment sparse decomposition is proposed. The speaker recognition method generally joins all the speaker's dictionaries together to form a large dictionary, although it has a certain distinction, there are two problems. On the one hand, the number of large dictionaries is too large to reduce the recognition efficiency; on the other hand, the competitive category is too much, which dilutes the competitiveness of the real speaker. In the first stage, the speech is decomposed to each speaker's dictionary in the first stage, and then the residual error is calculated by reconstructing the residual, and after sorting the residuals, a dictionary subset containing the real speaker is selected. The second stage splicing the new dictionary subsets into a large dictionary, and then decomposing the recognized speech to the large dictionary again, using the different words. The sparse representation on the dictionary identifies the speaker after calculating the score. This structure removes a large number of unrelated speaker's dictionaries in the first stage and reduces the time complexity of the algorithm. The second stage uses a regional segmentation method to ensure the recognition rate. The experiment shows that the two step sparse decomposition method proposed in this paper not only improves the recognition speed, but also improves the recognition rate. The accuracy is improved.


相关硕士学位论文 前1条

1 刘婷婷;基于因子分析的与文本无关的说话人辨认方法研究[D];中国科学技术大学;2014年




Copyright(c)文论论文网All Rights Reserved | 网站地图 |
