当前位置:主页 > 科技论文 > 信息工程论文 >

语音驱动虚拟说话人研究

发布时间:2018-09-05 18:22
【摘要】:语音驱动虚拟说话人技术指的是通过输入语音信息生成虚拟人面部动画。不仅提高用户对语音的理解度,而且提供一种真实、友好的人机交互方式。随着该技术的发展进步,势必为我们带来更多新的人机交互体验,极大丰富我们的日常生活。本论文采用两种方案研究语音驱动虚拟说话人动画合成,并对其进行分析对比。第一种方案,基于深度神经网络的语音驱动发音器官运动合成。第二种方案,基于MPEG-4的语音驱动虚拟说话人动画合成。这两种方案均需要找到相应的语料库,并对其进行提取构建出适合本论文研究问题的声视觉数据。第一种方案:语音的产生与声道发音器官的运动直接相关,如唇部、舌头和软腭的位置与移动。通过深度神经网络学习语音特征参数与发音器官位置信息两者之间的映射关系,系统根据输入的语音数据估计出发音器官的运动轨迹,并将其体现在一个三维的虚拟人上面。首先,在一系列参数下对比传统神经网络(Artificial Neural Network,ANN)和深度神经网络(Deep Neural Network,DNN)的实验结果,得到最优网络;其次,设置不同上下文语音特征参数长度并调整隐藏层单元数,获取最佳的上下文长度;最后,选取最优网络结构,由最优网络输出的发音器官运动轨迹信息控制发音器官运动合成,实现虚拟人动画合成。第二种方案:基于MPEG-4的语音驱动虚拟说话人动画合成的方法是一种数据驱动方法。首先,本论文从LIPS2008数据库中提取构建出适合本论文的声视觉语料库。然后,使用BP(Back Propagation)神经网络的方法学习语音特征参数与虚拟人人脸动画参数(Facial Animation Parameters,FAP)两者之间的映射关系。最后,系统根据预测得到的FAP序列控制虚拟人面部模型合成虚拟人口型动画。本论文分别对两种方案合成的动画进行主客观评价,均证明两种方案的有效性,并且动画效果自然逼真。对比两种动画合成方案,第一种方案需要一个与之相适应的唇部模型,虽然其精准度较高,但通用性不强,且其语料库不易获得。第二种方案符合MPEG-4标准,使用FAP序列驱动的虚拟人面部模型合成动画,其通用性更强,更便于广泛应用。
[Abstract]:Speech driven virtual speaker technology refers to the generation of virtual human facial animation by input of speech information. It not only improves the user's understanding of speech, but also provides a real and friendly way of human-computer interaction. With the development of the technology, it will bring us more new human-computer interaction experience and enrich our daily life. In this paper, two schemes are used to study speech driven virtual speaker animation synthesis, and to analyze and compare them. The first scheme is speech driven speech motion synthesis based on deep neural network. The second scheme is speech driven virtual speaker animation synthesis based on MPEG-4. Both schemes need to find the corresponding corpus and construct sound vision data suitable for the research of this paper. The first scheme: speech production is directly related to the movement of vocal organs, such as the lip, tongue and soft palate position and movement. The mapping relationship between speech feature parameters and speech organ location information is studied by deep neural network. The system estimates the movement track of the speech organ according to the input speech data and embodies it on a 3D virtual human. First, the optimal network is obtained by comparing the experimental results of traditional neural network (Artificial Neural Network,ANN) and depth neural network (Deep Neural Network,DNN) with a series of parameters. Secondly, the length of speech feature parameters of different contexts is set and the number of hidden layer cells is adjusted. The optimal context length is obtained. Finally, the optimal network structure is selected, and the motion path information of the speech organ output by the optimal network is used to control the speech organ motion synthesis, and the virtual human animation synthesis is realized. The second scheme: speech driven virtual speaker animation synthesis method based on MPEG-4 is a data-driven method. Firstly, the sound visual corpus is constructed from LIPS2008 database. Then, BP (Back Propagation) neural network is used to study the mapping relationship between speech feature parameters and virtual human face animation parameters (Facial Animation Parameters,FAP). Finally, according to the predicted FAP sequence, the virtual human facial model is controlled to synthesize virtual population animation. In this paper, the animations synthesized by the two schemes are evaluated subjectively and objectively, and the validity of the two schemes is proved, and the animation effect is natural and lifelike. Compared with the two animation synthesis schemes, the first one needs a lip model which is suitable for it. Although its accuracy is high, it is not universal enough, and its corpus is not easy to obtain. The second scheme conforms to MPEG-4 standard and uses FAP sequence driven virtual human facial model to synthesize animation, which is more versatile and more convenient for wide application.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TN912.3

【参考文献】

相关期刊论文 前10条

1 吴志明;侯进;位雪岭;;基于运动分解与权重函数的嘴部中文语音动画[J];计算机应用研究;2016年12期

2 雷腾;侯进;王献;;基于改进Candide-3模型的眼部动画建模[J];哈尔滨工程大学学报;2015年04期

3 万贤美;金小刚;;真实感3D人脸表情合成技术研究进展[J];计算机辅助设计与图形学学报;2014年02期

4 王娅;侯进;王献;;基于顶点权重的网格简化在虚拟人脸中的应用[J];计算机仿真;2014年02期

5 李冰锋;谢磊;朱鹏程;樊博;;语音驱动虚拟说话人的自然头动生成[J];清华大学学报(自然科学版);2013年06期

6 杨逸;侯进;王献;;基于运动轨迹分析的3D唇舌肌肉控制模型[J];计算机应用研究;2013年07期

7 李皓;陈艳艳;唐朝京;;唇部子运动与权重函数表征的汉语动态视位[J];信号处理;2012年03期

8 李冰锋;谢磊;周祥增;付中华;张艳宁;;实时语音驱动的虚拟说话人[J];清华大学学报(自然科学版);2011年09期

9 范懿文;柳学成;夏时洪;;人脸表情动画与语音的典型相关性分析[J];计算机辅助设计与图形学学报;2011年05期

10 尹宝才;王恺;王立春;;基于MPEG-4的融合多元素的三维人脸动画合成方法[J];北京工业大学学报;2011年02期



本文编号:2225080

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/xinxigongchenglunwen/2225080.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户34035***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com