基于特定发音单元的音视频信息一致性评估方法及实现
发布时间:2018-11-20 14:37
【摘要】:我国老龄化日益严重,社保金的发放面临日益严峻的冒领等欺诈问题,合适的受益人的身份认证问题日益突出;时时报道大型音乐会存在假唱问题,但又拿不出确实的证据,有必要对疑似假唱进行检测;动漫产业是国家鼓励的低碳产业,,动漫配音质量也缺乏客观评价技术。由于真实语音是由人的发音器官产生,语音信号与唇动信息存在严格的一致性。本文从音视频一致性分析方面着手探讨基于语音的身份认证语音样本的真实性,提高对社保金受益人身份认证的准确性,有效防止冒领。同时也为配音质量客观评价和假唱等问题的解决提供技术基础。 本文提出基于特定发音单元的音视频一致性分析方法,基本分析算法为协惯量分析算法(Co-inertia analysis,CoIA),关联视频中的语音和唇部运动,分析视频中的语音与唇动的一致性。分为训练阶段和测试分析阶段,训练阶段分别对音频和唇动视频中的唇部图像提取特征,求两者的映射矩阵;测试分析阶段将特征投影在映射矩阵,投影值的协方差均值即为所求的相关系数。协惯量分析的相关系数越大,说明音频与视频越相关;等错误率越小,说明音视频一致性的评估性能越好。实验表明协惯量分析算法的音视频一致性分析性能较优。 对特定发音单元的选取以及基于特定发音单元的音视频一致性分析是本文的创新点,选取句中特定的发音单元替代句子进行音视频一致性分析。本文首先分析汉语声韵母的视位口型特征,以视位的口型特征相似为依据对韵母进行聚类,将有相同口型参数的韵母聚为一类,一共将韵母聚类为16类。其次通过协惯量分析算法选取相关系数较高的发音单元类别作为特定发音单元,并通过实验构造一致与不一致数据进行一致性分析,验证选取的特定发音单元的合理性。最后对整句与从整句中提取的特定发音单元进行音视频一致性对比分析,作基于特定音节的聚类与整句的对比分析。实验数据库采用350个整句,其中一个整句时长约3秒至10秒之间,对每一整句通过能量、过零率以及基频提取和识别7组特定发音单元,一个特定发音单元的时长约为0.3秒至0.8秒,从350个句子中提取特定发音单元样本,提取的特定发音单元的时长约比整句的减少四分之三,减少运算数据量。分别对特定发音单元和整句通过协惯量分析算法进行音视频一致性分析,实验结果显示特定发音单元一致性评估等错误率比整句降低2.7%。
[Abstract]:The aging of our country is becoming more and more serious, the payment of social security funds is facing more and more serious fraud problems, such as fraud, and the problem of identity authentication of suitable beneficiaries is becoming more and more prominent. It is often reported that there are lip-synching problems in large-scale concerts, but there is no real evidence. It is necessary to detect the suspected lip-synching; animation industry is a low-carbon industry encouraged by the state, and the quality of animation dubbing is also lack of objective evaluation technology. Because real speech is produced by human phonetic organs, there is strict consistency between speech signal and lip movement information. From the aspect of audio and video consistency analysis, this paper discusses the authenticity of voice samples based on voice identity authentication, improves the accuracy of identity authentication for beneficiaries of social security funds, and effectively prevents fraud. At the same time, it also provides the technical basis for the objective evaluation of dubbing quality and the solution of lip-synching problems. In this paper, a method of audio and video consistency analysis based on specific pronunciation unit is proposed. The basic analysis algorithm is coinertia analysis algorithm (Co-inertia analysis,CoIA), which correlates speech and lip movement in video and analyzes the consistency between speech and lip movement in video. It is divided into the training stage and the test analysis stage. In the training stage, the lip image in the audio and lip moving video is extracted from the feature, and the mapping matrix between the two is obtained. In the phase of test and analysis, the feature is projected on the mapping matrix, and the covariance mean of the projection value is the coefficient of correlation. The higher the correlation coefficient of coinertia analysis, the more correlation between audio and video, and the smaller the error rate, the better the performance of audio and video consistency evaluation. The experimental results show that the coinertia analysis algorithm has better performance for audio and video consistency analysis. The innovation of this paper is the selection of specific pronunciation units and the analysis of audio and video consistency based on specific pronunciation units. This paper first analyzes the vowel features of Chinese consonants. Based on the similarity of vowels, the vowels with the same oral parameters are clustered into one class, and the vowels are grouped into 16 categories. Secondly, the class of pronunciation unit with high correlation coefficient is selected as the specific pronunciation unit by the algorithm of coinertia analysis, and the consistency analysis of consistent and inconsistent data is carried out through the experimental construction to verify the rationality of the selected specific pronunciation unit. Finally, the consistency of audio and video between the whole sentence and the specific phonetic unit extracted from the whole sentence is analyzed, and the clustering based on the specific syllable is compared with the whole sentence. The experimental database uses 350 sentences, one of which is about 3 to 10 seconds long, and extracts and recognizes seven groups of specific phonetic units for each sentence by energy, zero crossing rate, and fundamental frequency. The length of a specific pronunciation unit is about 0.3 seconds to 0.8 seconds. A sample of a specific pronunciation unit is extracted from 350 sentences. The length of time of a specific pronunciation unit is about 3/4 less than that of the whole sentence, and the amount of computing data is reduced. The coinertia analysis algorithm is used to analyze the consistency of specific pronunciation units and the whole sentence. The experimental results show that the error rate of consistency evaluation of specific pronunciation units is 2.7% lower than that of the whole sentence.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TN912.3
本文编号:2345176
[Abstract]:The aging of our country is becoming more and more serious, the payment of social security funds is facing more and more serious fraud problems, such as fraud, and the problem of identity authentication of suitable beneficiaries is becoming more and more prominent. It is often reported that there are lip-synching problems in large-scale concerts, but there is no real evidence. It is necessary to detect the suspected lip-synching; animation industry is a low-carbon industry encouraged by the state, and the quality of animation dubbing is also lack of objective evaluation technology. Because real speech is produced by human phonetic organs, there is strict consistency between speech signal and lip movement information. From the aspect of audio and video consistency analysis, this paper discusses the authenticity of voice samples based on voice identity authentication, improves the accuracy of identity authentication for beneficiaries of social security funds, and effectively prevents fraud. At the same time, it also provides the technical basis for the objective evaluation of dubbing quality and the solution of lip-synching problems. In this paper, a method of audio and video consistency analysis based on specific pronunciation unit is proposed. The basic analysis algorithm is coinertia analysis algorithm (Co-inertia analysis,CoIA), which correlates speech and lip movement in video and analyzes the consistency between speech and lip movement in video. It is divided into the training stage and the test analysis stage. In the training stage, the lip image in the audio and lip moving video is extracted from the feature, and the mapping matrix between the two is obtained. In the phase of test and analysis, the feature is projected on the mapping matrix, and the covariance mean of the projection value is the coefficient of correlation. The higher the correlation coefficient of coinertia analysis, the more correlation between audio and video, and the smaller the error rate, the better the performance of audio and video consistency evaluation. The experimental results show that the coinertia analysis algorithm has better performance for audio and video consistency analysis. The innovation of this paper is the selection of specific pronunciation units and the analysis of audio and video consistency based on specific pronunciation units. This paper first analyzes the vowel features of Chinese consonants. Based on the similarity of vowels, the vowels with the same oral parameters are clustered into one class, and the vowels are grouped into 16 categories. Secondly, the class of pronunciation unit with high correlation coefficient is selected as the specific pronunciation unit by the algorithm of coinertia analysis, and the consistency analysis of consistent and inconsistent data is carried out through the experimental construction to verify the rationality of the selected specific pronunciation unit. Finally, the consistency of audio and video between the whole sentence and the specific phonetic unit extracted from the whole sentence is analyzed, and the clustering based on the specific syllable is compared with the whole sentence. The experimental database uses 350 sentences, one of which is about 3 to 10 seconds long, and extracts and recognizes seven groups of specific phonetic units for each sentence by energy, zero crossing rate, and fundamental frequency. The length of a specific pronunciation unit is about 0.3 seconds to 0.8 seconds. A sample of a specific pronunciation unit is extracted from 350 sentences. The length of time of a specific pronunciation unit is about 3/4 less than that of the whole sentence, and the amount of computing data is reduced. The coinertia analysis algorithm is used to analyze the consistency of specific pronunciation units and the whole sentence. The experimental results show that the error rate of consistency evaluation of specific pronunciation units is 2.7% lower than that of the whole sentence.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TN912.3
【参考文献】
相关期刊论文 前10条
1 胡光锐,韦晓东;基于倒谱特征的带噪语音端点检测[J];电子学报;2000年10期
2 姚鸿勋,高文,王瑞,郎咸波;视觉语言——唇读综述[J];电子学报;2001年02期
3 杨丹宁,郭峰,文成义;由文本至口形的媒体变换技术的研究[J];电子学报;1996年01期
4 陈雁翔;刘鸣;;基于发音特征的音视频说话人识别鲁棒性的研究[J];电子学报;2010年12期
5 周治,杜利民,徐彦君;汉语听觉视觉双模态信息的互补作用[J];中国科学E辑:技术科学;2000年03期
6 洪晓鹏,姚鸿勋,徐铭辉;基于句子级的唇读语料库及其切分算法[J];计算机工程与应用;2005年03期
7 柴秀娟;姚鸿勋;高文;王瑞;;唇读识别中的基本口型分类[J];计算机科学;2002年02期
8 王琢玉,贺前华;基于主元分析的人脸特征点定位算法的研究[J];计算机应用;2005年11期
9 单卫,姚鸿勋,高文;唇读中序列口型的分类[J];中文信息学报;2002年01期
10 刘青山,卢汉清,马颂德;综述人脸识别中的子空间方法[J];自动化学报;2003年06期
本文编号:2345176
本文链接:https://www.wllwen.com/wenyilunwen/dongmansheji/2345176.html