中文问答系统关键技术研究

发布时间：2018-08-25 11:46

【摘要】：问答系统是融合了自然语言处理技术和信息检索技术于一身的新一代搜索引擎，其有着非常重要的应用前景，是自然语言处理领域和信息检索领域的一个重要分支，已吸引大量科学研究人员的兴趣。本文围绕问答系统实现过程中中文分词，问题分类，，问题关键词抽取，候选答案集的构建等关键技术进行了一系列的研究与探讨，在以下方面做出了一些尝试性研究成果：（1）实验生成依存骨架规则库，并且利用条件随机场进行问题焦点词提取方法。问题分类模块结合了规则与统计两种方法的优点，对未知类别的问题依次进行疑问词-类别，疑问词+焦点词-类别和依存骨架规则库进行分类，对于不能用规则库解决的问题则通过贝叶斯模型进行确定。在小规模语料上取得了76%的分类准确率。实验结果说明疑问词-词性三元组规则的利用以及焦点词提取方法的改进对问题分类具有积极的效果。（2）在实验中利用条件随机场模型进行关键词提取的方法。通过利用条件随机场模型，在学习了已标注关键词的问题语料库基础上对测试问题集进行标注。在小规模的问题测试语料上取得了较高的正确率。（3）对计算候选句子分值的公式进行了修改。在候选句子排序中考虑了同义关键词位置相似度，通过计算用户问题和候选句子的同义关键词相似度、同义关键词位置相似度和句子长度相似度三个句子结构信息，从而对候选句子进行排序。实验结果表明这种计算方法对人物、地点、数字和时间等事实性问题类型效果较好。
[Abstract]:Question Answering System is a new generation of search engine which combines natural language processing technology and information retrieval technology. It has a very important application prospect. It is an important branch in the field of natural language processing and information retrieval. It has attracted the interest of a large number of scientific researchers. The key technologies such as word segmentation, question classification, question keyword extraction, candidate answer set construction and so on have been studied and discussed in a series of ways. Some tentative research results have been made in the following aspects:
(1) The dependency skeleton rule base is generated experimentally, and the problem focus words are extracted by conditional random fields. The problem classification module combines the advantages of rule and statistic methods to classify the unknown categories of problems in turn into interrogative words-categories, interrogative words+focus words-categories and dependency skeleton rule base. The problem solved by the database is determined by the Bayesian model, and the classification accuracy is 76% in the small-scale corpus. The experimental results show that the use of the interrogative-part-of-speech ternary rule and the improvement of the focus word extraction method have a positive effect on the problem classification.
(2) Conditional random field model is used to extract keywords in the experiment. The test question set is labeled on the basis of the problem corpus with labeled keywords by using the conditional random field model.
(3) The formula for calculating candidate sentence scores is modified. The position similarity of synonymous keywords is considered in candidate sentence ranking. The candidate sentences are sorted by computing the similarity of synonymous keywords between user questions and candidate sentences, the position similarity of synonymous keywords and the length similarity of sentences. The experimental results show that this method is effective for factual problems such as people, places, numbers and time.
【学位授予单位】：宁波大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【引证文献】