基于短语句法组块的中文FAQ问答系统研究

发布时间：2018-03-02 06:21

本文关键词： 中文问答系统受限域问句分类组块编辑距离问句相似度　出处：《昆明理工大学》2013年硕士论文　论文类型：学位论文

【摘要】：问答系统是自然语言处理领域的一个重要方向,旨在让用户直接用自然语言提问并获得答案。相对于传统关键词方式的搜索引擎来说,自动问答系统具有显著的优势。在受限域,基于FAQ(常问问题)的问答系统把用户经常提问的问题和相关的答案组织在一起,在问题答案的定位上,更准确,快捷和高效,在日常生活的各个领域,有着重要的应用前景,是当前研究的热点。本文主要利用自然语言处理技术,对受限域的中文问句分类,问句的组块分析,问句相似度计算等问答系统关键技术进行探讨与研究,并在此基础上实现了云南旅游领域FAQ问答原型系统。具体来说,本文主要取得了以下几个较有特色的成果： (1)针对传统的概率统计方法进行问句分类,分类器的训练只依赖于问句中特征词的出现频率,没有考虑到问句中词与词之间的语义关系的问题,本文提出了一种语义相似度与隐Markov序列分析模型相结合的问句分类方法。该方法首先提取所有问句类别的特征词集作为不同隐Markov模型分类器的观察序列,其次以不同类别问句特征词集的形成演化过程作为状态转换序列,最后,通过词语语义相似度计算方法计算出特征词在不同类别状态下的观测值概率分布,分别构建不同类型的问句隐Markov分类模型。对旅游领域问句进行了分类实验,结果表明提出的方法比现有方法在准确率上有一定的提高。 (2)现有的组块分析方法中,主要是通过词语字面信息和统计特征来进行组块,没有考虑到不同类型问句的句法结构特征。针对以上问题,本文提出了一种基于短语句法树的中文问句组块分析方法。该方法首先在已经获取问句类别的基础上,结合问句的提问方式和词法特征,分析问句的句型,归纳总结出不同问句的结构形态。然后利用短语句法分析器生成问句的短语句法树,最后结合领域问句的特性,自定义组块规则,对领域问句进行组块的识别和标注。实验结果表明,该方法具有较好的效果。 (3)针对现有的汉语句子相似度计算方法,没有充分利用句子词汇语义信息和句子结构信息的问题,本文提出了一种基于改进编辑距离的领域问句相似度计算方法。该方法以组块取代字符作为基本的编辑单元,根据领域问句的特点,对不同的词赋予不同的权重,并通过知网计算块内词语相似度来衡量块间的替换代价,对不同类型的组块赋予不同的插入、删除代价。实验结果表明,该方法具有较好的效果。 (4)利用上述研究成果,并以云南旅游领域为例,对领域问句进行分类,组块分析和标注,设计并实现了云南旅游FAQ问答原型系统。
[Abstract]:Question answering system is an important direction in the field of natural language processing, which aims to let users directly use natural language to ask questions and get answers. Automatic question answering system has significant advantages. In restricted domain, FAQ-based question answering system organizes users' frequently asked questions and related answers together, and is more accurate, fast and efficient in the positioning of question answers. In every field of daily life, it has an important application prospect and is a hot research topic at present. This paper mainly uses natural language processing technology, classifies Chinese question sentence in restricted domain, and analyzes the block of question sentence. The key technologies of question answering system such as question similarity calculation are discussed and studied, and the prototype system of FAQ question answering in Yunnan tourism field is implemented on this basis. The training of classifier only depends on the frequency of feature words in question sentences, and does not take into account the semantic relationship between words and words in question sentences. In this paper, a semantic similarity method combined with the hidden Markov sequence analysis model is proposed, in which the feature word sets of all question categories are extracted as observation sequences of different hidden Markov model classifiers. Secondly, the formation and evolution of feature word sets of different types of questions are taken as the sequence of state transition. Finally, the probability distribution of the observed values of feature words in different categories is calculated by the method of semantic similarity calculation. Different types of implicit Markov classification models of question sentences are constructed, and the classification experiments of question sentences in tourism field are carried out. The results show that the proposed method is more accurate than the existing methods. (2) in the existing methods of block analysis, it is mainly through the literal information and statistical features of words, and the syntactic structure characteristics of different types of question sentences are not taken into account. In this paper, a method of Chinese question block analysis based on phrase syntax tree is proposed. The structure of different questions is summed up. Then the phrase syntax tree of question is generated by using phrase parser. Finally, according to the characteristics of domain questions, the block rules are defined. The block recognition and tagging of domain questions are carried out. The experimental results show that the proposed method is effective. (3) aiming at the problem that the existing Chinese sentence similarity calculation methods do not make full use of the semantic information of sentence vocabulary and sentence structure information, In this paper, a method for calculating the similarity of domain question sentences based on improved editing distance is proposed, in which block substitution for characters is used as the basic editing unit. According to the characteristics of domain questions, different words are given different weights. The similarity of words in blocks is calculated to measure the substitution cost of blocks, and different insertion and deletion costs are given to different types of blocks. The experimental results show that the proposed method is effective. Using the above research results and taking Yunnan tourism field as an example, this paper classifies, analyzes and annotates the domain questions, and designs and implements the FAQ question answering prototype system of Yunnan tourism.
【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】