基于语谱图的特定人二字汉语词汇语音识别研究方法

发布时间：2018-02-01 21:30

本文关键词： 语音识别语谱图特征融合支持向量机(SVM)　出处：《东北师范大学》2017年硕士论文　论文类型：学位论文

【摘要】：自计算机诞生以来人类梦寐以求的想法就是让计算机听懂人类的语言。随着电子产品的飞速发展,人们越来越迫切想要摆脱键盘的束缚,取而代之以语音输入这样便于性人性化的输入方式。尤其是汉语汉字的输入,一直以来都是计算机应用普及的一大难题,因此利用汉语语音交流互动是一个非常重要的研究课题。因为现代汉语常用词表中使用频度较高的词语有56008个,其中五音节和五音节以上词语162个,四音节词语5855个,三音节词语6459个,双音节词语40351个,单音节词语3181个,由此可见双音节词语占所有词语比例的72%,在常用词起着不可估量的作用。所以本文选用10个二字汉语词汇进行语音识别算法研究,具有较强的代表性。传统的语音分析采用固定窗傅立叶变换获取语音信号的时频局部化信息,以短时语音帧为基本单位进行处理的分割方法破坏了音节承载信息的整体性,在一定程度上影响了语音识别的效果。本文采用图像处理技术进行语音识别,对二字汉语词汇语音的语谱图进行特征分析、提取并采用了四种方法对语谱图进行特征量提取,对语谱图进行等宽度分带行投影、列投影和二进宽度分带行投影,以及采用二维离散db4小波基分别对宽窄带语谱图进行6层小波包分解,并计算出每层的水平细节能量值,垂直细节能量值和对角细节能量值。将这四种方法所提取出的特征集合作为识别的特征向量,以支持向量机为分类器对二字汉语词汇识别。该算法利用语谱图的整体特征逐字逐词进行语音识别,能够凸显语音信号的整体时频特性,依据汉语的特点,将每一条语音命令作为一副图像进行词汇研究,保证了语句的完整性,有助于提高语音识别系统的识别率和鲁棒性。通过采用图像处理技术对语音样本进行去噪处理,虽然去噪后的语音文件相对于无噪语音样本效果很差,但是本文也进行了系统的尝试与探究,同时为后续语音增强方法的继续深入探究提供了重要依据和线索。
[Abstract]:Since the birth of the computer, the dream is to make the computer understand the human language. With the rapid development of electronic products, people are increasingly eager to get rid of the shackles of the keyboard. Instead of speech input, which is a convenient and humanized input method, especially the input of Chinese characters, it has always been a difficult problem for computer application to popularize. Therefore, the use of Chinese phonetic communication and interaction is a very important research topic, because there are 56 008 words used frequently in the list of common words in modern Chinese. Among them, there are 162 words with five syllables and more than five syllables, 5855 words with four syllables, 6459 words with three syllables, 40351 words with two syllables and 3181 words with one syllable. It can be seen that the two-syllable words account for the proportion of all words and play an inestimable role in the common words. So this paper chooses 10 two-character Chinese words to study the speech recognition algorithm. The traditional speech analysis uses the fixed window Fourier transform to obtain the time and frequency localization information of the speech signal. The segmentation method based on short-time speech frames destroys the integrity of syllable information and affects the effect of speech recognition to a certain extent. In this paper, image processing technology is used for speech recognition. This paper analyzes the features of the two word Chinese vocabulary phonogram, extracts and uses four methods to extract the feature quantity of the spectrum image, and carries on the equal-width banding line projection to the language spectrum image. Column projection and dyadic width banding line projection, as well as 2-D discrete db4 wavelet basis, are used to decompose the broad and narrow band spectrum with six layers of wavelet packet, and calculate the horizontal detail energy value of each layer. Vertical detail energy value and diagonal detail energy value. The feature set extracted by these four methods is used as the recognition feature vector. The support vector machine (SVM) is used to recognize the two-character Chinese vocabulary. The algorithm uses the whole feature of the spectrum map to recognize the speech word by word word by word, which can highlight the overall time-frequency characteristics of the speech signal, according to the characteristics of Chinese. Every voice command is used as an image for lexical study to ensure the integrity of the sentence. It is helpful to improve the recognition rate and robustness of the speech recognition system. Image processing technology is used to Denoise the speech sample, although the effect of the de-noised speech file is very poor compared with the noiseless speech sample. However, this paper also makes a systematic attempt and exploration, and provides an important basis and clue for the further exploration of the subsequent speech enhancement methods.
【学位授予单位】：东北师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TN912.34

【参考文献】