面向服务机器人的口语对话系统和语言模型技术研究

发布时间：2018-07-23 13:10

【摘要】：随着语音识别技术的日渐成熟,在各个领域的应用层出不穷。对于服务机器人领域,语音技术主要用于服务机器人上的口语对话系统,本文针对可佳机器人的具体应用场景,探究了应用于服务机器人口语对话系统设计与实现的过程。此外,本文还研究了与语音识别中语言模型相关的技术-联合无监督词聚类的递归神经网络语言模型。本文对面向服务机器人口语对话系统的研究主要涉及两个方面：一是语音识别,二是对话管理。在语音识别方面,先较为详细的介绍了语音识别相关基本原理,然后介绍面向可佳机器人应用的语料收集,随后对模块所需声学模型训练的完整步骤做了介绍,并对几种声学模型在本文提供的训练集和测试集下的性能做了实验和分析,实验表明,使用上下文相关的三音素模型具有最好的识别效果,最佳词识别率达到98.39%,对应的句子识别率为90.83%。针对机器人上机载计算设备计算能力有限和机器人在运行过程中能提供自身状态信息的特点,本文设计了可以压缩解码时搜索空间的动态改变语言模型机制,并对最后完成的语音识别模块做了实验和分析,实验中基于动态语言模型机制的语音识别模块最佳句子识别率为87.95%,比不采用动态语言模型机制的语音识别模块高出12.05%。在对话管理方面,针对服务机器人的特点,本文采用层叠状态机的设计方法并使用python语言实现了这一对话管理框架,接着介绍了我们对话管理框架中的多模态信息加入和验证与确认机制,并最后介绍了本文设计的对话管理在可佳机器人上具体任务cocktailparty上的应用。另外,本文还深入研究了无监督词聚类方法在递归神经网络语言模型上的应用。基于递归神经网络的语言模型被证明有领先的效果,研究表明,在递归神经网络语言模型的输入层加入词性标注信息,可以显著提高模型的效果。但使用词性标注需要手工标注的数据训练,耗费大量的人力物力,并且额外的标注器增加了模型的复杂性。为解决上述问题,本文尝试将布朗词聚类的结果代替词性标注信息加入到递归神经网络语言模型输入层。实验显示,在Penn Treebank语料上,加入布朗词类信息的递归神经网络语言模型相比原递归神经网络语言模型困惑度下降8-9%。
[Abstract]:With the maturation of speech recognition technology, the applications in various fields emerge one after another. For the field of service robot, the speech technology is mainly used in the spoken dialogue system of the service robot. In this paper, the design and implementation of the oral dialogue system for the service robot are discussed in the light of the specific application scene of the good robot. In addition, this paper also studies the language model associated with the language model in speech recognition, which combines the unsupervised word clustering with the recurrent neural network language model. In this paper, the research of Service-Oriented Robot Oral Dialogue system mainly involves two aspects: one is speech recognition, the other is dialogue management. In the aspect of speech recognition, the basic principles of speech recognition are introduced in detail, and then the collection of corpus for the application of good robot is introduced, and then the complete steps of acoustic model training for the module are introduced. The performance of several acoustic models under the training set and test set provided in this paper is tested and analyzed. The experiment shows that the use of context-dependent trichonic model has the best recognition effect. The best word recognition rate is 98.39 and the corresponding sentence recognition rate is 90.83. In view of the limited computing power of the airborne computing equipment on the robot and the ability of the robot to provide its own state information in the course of operation, this paper designs a dynamic changing language model mechanism which can compress and decode the search space. The final speech recognition module is tested and analyzed. The optimal sentence recognition rate of the speech recognition module based on dynamic language model is 87.95, which is 12.05 higher than that of the speech recognition module without dynamic language model. In the aspect of dialogue management, according to the characteristics of service robot, this paper adopts the design method of stacked state machine and implements this dialog management framework with python language. Then we introduce the mechanism of multi-modal information joining, verification and validation in our dialogue management framework. Finally, we introduce the application of the dialogue management in cocktailparty. In addition, the application of unsupervised word clustering in recurrent neural network language model is also studied. The language model based on recurrent neural network has been proved to have the leading effect. The research shows that the effect of the model can be improved significantly by adding part of speech tagging information into the input layer of the language model of recurrent neural network. However, the use of part of speech tagging requires manual tagging data training, which consumes a lot of manpower and material resources, and the extra tagger increases the complexity of the model. In order to solve the above problems, this paper attempts to add the result of Brownian word clustering to the input layer of recursive neural network language model instead of part of speech tagging information. The experimental results show that the degree of confusion of the recurrent neural network language model with Brown's part of speech information is 8-9 lower than that of the original recursive neural network language model on the Penn Treebank corpus.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP242;TN912.34

【参考文献】