当前位置:主页 > 科技论文 > 搜索引擎论文 >

面向问答系统的问题分类与答案抽取研究

发布时间:2018-06-02 13:07

  本文选题:问答系统 + 问题分类 ; 参考:《东北大学》2013年硕士论文


【摘要】:随着人工智能、信息检索以及自然语言处理等技术地发展,问答系统的研究也取得了长足地进步。特别是TREC等会议举办的问答系统评测任务又进一步推动了问答系统的发展。相比英文而言,中文领域并没有流行的问答系统评测,相关的数据集也十分匮乏,导致目前中文问答系统的研究相对落后。本文使用基于在线搜索引擎的方式来实现答案检索,主要研究工作为面向中文问答系统的问题分析与答案抽取。 在问题分析部分,本文首先提出了一种基于词组合及问题类别的停用词选取方法,在选取停用词时先从由n个词组合成的短语中提取,并且在提取过程中考虑问题类别情况,通过不断减小n的值来完成迭代。在本文的数据集上,这一方法取得了较好的效果。 接着,针对本文的问题,基于TF-IDF的思想提出了一种问句分类特征选取方法TFC-ICF。该方法综合考虑了一个词语标识某个类别的能力及其在各个类别中的分布情况,从而可以选取到质量更高的分类特征。本文使用基于SVM模型的分类器来进行自动分类,使用TFC-ICF方法选取的特征词在问题分类上的准确率可以达到80.45%。为了进一步提高问题分类的性能,本文以TFC-ICF方法为基准,提出了人工特征选取方法、基于关键词扩展的特征选取和选取语法信息的特征选取,并且在后两种方法中实验了多种不同的特征使用方法。通过与TFC-ICF方法结合使用,上述三种方法在问题分类上的最高准确率分别可以达到86.01%、85.14%和82.13%。 在答案抽取部分,本文首先讨论了如何使用基于向量空间模型的句子相似度计算方法选取候选答案句子,进而使用实体识别的方法从候选答案句子中提取与问句类别相关的实体,最后,本文提出了一种基于句子相似度和实体信息的答案抽取方法,在NTCIR5的CLQA问答测试集上取得了较好的实验结果。 本文对问题分类和答案抽取做了重点研究,并得到了一些成果,但其中也存在一定的问题,比如,问题数据集质量较差、实体识别的效果还不能完全令人满意、最终答案抽取的效果也不够理想。
[Abstract]:With the development of artificial intelligence, information retrieval and natural language processing, the research of question answering system has made great progress. Especially, the evaluation task of Q & A system held by TREC and other conferences has further promoted the development of Q & A system. Compared with English, there is no popular question answering system evaluation in the Chinese field, and the related data sets are also very scarce, which leads to the relatively backward research on the Chinese question answering system at present. In this paper, an online search engine is used to realize the answer retrieval. The main research work is question analysis and answer extraction for Chinese question answering system. In the part of problem analysis, this paper first proposes a method of selecting stop words based on word combination and problem categories. When selecting stop words, we first extract them from phrases composed of n words, and consider the situation of problem categories in the process of extraction. The iteration is completed by continuously reducing the value of n. On the data set in this paper, this method has achieved good results. Then, in order to solve the problem in this paper, a method of feature selection of question classification based on TF-IDF is proposed. In this method, the ability of a word to identify a certain category and its distribution in each category are considered synthetically, so that the classification features of higher quality can be selected. In this paper, a classifier based on SVM model is used for automatic classification. The accuracy of feature words selected by TFC-ICF method in problem classification can reach 80.45%. In order to further improve the performance of problem classification, based on the TFC-ICF method, this paper proposes a method of artificial feature selection, which is based on keyword expansion and feature selection of selected syntax information. And in the latter two methods, we have experimented with many different feature usage methods. By combining with TFC-ICF method, the highest accuracy of the above three methods in problem classification can reach 86.01% and 82.13% respectively. In the part of answer extraction, this paper first discusses how to select candidate answer sentences by using the method of sentence similarity calculation based on vector space model. Then the entity recognition method is used to extract the entity related to the question sentence category from the candidate answer sentence. Finally, this paper proposes a method based on sentence similarity and entity information to extract the answer. Good experimental results are obtained on the CLQA quiz test set of NTCIR5. This paper focuses on the problem classification and answer extraction, and gets some results, but there are some problems, such as the poor quality of the problem data set, the effect of entity recognition is not completely satisfactory. The final answer extraction effect is not ideal.
【学位授予单位】:东北大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3

【参考文献】

相关期刊论文 前5条

1 顾益军,樊孝忠,王建华,汪涛,黄维金;中文停用词表的自动选取[J];北京理工大学学报;2005年04期

2 邱锡鹏;缪有栋;黄萱菁;;基于主动学习的中文问题分类数据集构建[J];哈尔滨工业大学学报;2012年05期

3 文勖;张宇;刘挺;马金山;;基于句法结构分析的中文问题分类[J];中文信息学报;2006年02期

4 黄昌宁;赵海;;中文分词十年回顾[J];中文信息学报;2007年03期

5 冯志伟;;自然语言处理的历史与现状[J];中国外语;2008年01期



本文编号:1968871

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1968871.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户09e9a***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com