维基百科在IR4QA系统中的应用研究

发布时间：2018-06-29 00:07

本文选题：问答系统 + IR4QA　；参考：《武汉科技大学》2012年硕士论文

【摘要】：问答系统是新一代智能搜索引擎，它允许用户以自然语言提问，并能够向用户返回准确的答案。所以，与传统的搜索引擎相比，问答系统能更好的满足用户的查询要求，更准确地检索出用户所需要的答案。本文主要基于NTCIR8中所做的工作，研究的是问题理解和信息检索这两个中文问答系统中的主要部分，即IR4QA阶段的研究，并最终实现了这个IR4QA系统。问题理解阶段是所有涉及到自然语言接口系统的研究内容，是问答系统开始执行的第一个阶段，这个阶段的分析结果对后面的几个阶段的处理有着重大的影响；信息检索阶段在问答系统中处于中间的执行阶段，它的分析结果将会极大地影响后续模块的结果质量。本文通过比较和分析一般问答系统中这两个阶段目前存在的问题，找出更有效的处理方法应用在我们的系统中。本文在前人的研究基础上作了如下的一些工作： (1)整理并分析国内外有关自动问答系统与搜索引擎技术的研究现状，结合两种系统的长处，对于当前使用者在运用搜索引擎时出现的搜索结果冗杂、花费时间长、结果准确度不高等一些问题，提出了将维基百科应用于自动问答系统的方法，即基于维基百科的IR4QA系统，设计并实现了该系统。 (2)通过分析系统最终达到的效果，本文在系统设计初期就制定了一系列切实可行的方法。以这些方法为基础，同时采用分层以及模块化的设计思想，确定了系统的设计原则，并将系统分为索引生成模块、问题分析模块、查询扩展模块、文档检索模块和文档重排模块。 (3)研究了系统中涉及到的一些关键技术，对实现过程中遇到的难点做了理论和技术的积累，并提出了切实可行的解决方案。 (4)在问题分类时，结合问题集中问题的特点，并考虑到汉语语法和语义分析的庞大工作任务，提高系统的质量，系统没有采用一般用在英文问答系统里面的机器学习的问题分类方法，而是利用启发式的规则，通过识别问题中的疑问词来工作的。这对于问题集中的这些句法简单的问题能达到良好的识别效果。 (5)对于问题与查询文档中存在的词不匹配的情况，采用了基于维基百科的查询扩展方法，包括维基页面的查找、相关段落的定位和扩展词的选取。通过实验对比证明此方法能够有效地提高检索结果的质量。 (6)为了进一步提高检索结果的准确率，系统还在文档重排模块使用BM25算法对检索结果进行重排，，重排后得到最终的检索结果。
[Abstract]:Q & A system is a new generation of intelligent search engine, it allows users to ask questions in natural language, and can return accurate answers to users. Therefore, compared with the traditional search engine, the Q & A system can better meet the query requirements of users and more accurately retrieve the answers that users need. Based on the work done in NTCIR8, this paper studies the two main parts of the Chinese question answering system, namely, IR4QA, and finally implements the IR4QA system. The problem understanding stage is the research content of all the natural language interface systems, which is the first stage of the question answering system. The analysis results of this stage have a great influence on the processing of the later several stages. The information retrieval stage is in the middle of the execution stage in the question and answer system, and its analysis results will greatly affect the quality of the results of the subsequent modules. In this paper, by comparing and analyzing the problems existing in the two stages of the general question answering system, we find out more effective methods to be applied in our system. On the basis of previous studies, this paper has done some work as follows: (1) sorting out and analyzing the research status of automatic question answering system and search engine technology at home and abroad, combining the advantages of the two systems, In this paper, the author puts forward a method of applying Wikipedia to the automatic question answering system, that is, IR4QA system based on Wikipedia, for some problems, such as miscellaneous search results, long time consuming, low accuracy of results and so on, which appear when users use search engines. The system is designed and implemented. (2) by analyzing the effect of the system, a series of feasible methods have been developed in the early stage of the system design. Based on these methods, the design principles of the system are determined by adopting the idea of layering and modularization, and the system is divided into three modules: index generation module, problem analysis module, query expansion module, and so on. Document retrieval module and document rearrangement module. (3) some key technologies involved in the system are studied, and the difficulties encountered in the process of implementation are accumulated in theory and technology. And put forward practical solutions. (4) in the process of problem classification, considering the characteristics of problem focus and taking into account the huge task of Chinese grammar and semantic analysis, the quality of the system can be improved. The system does not adopt the problem classification method which is generally used in the English question answering system, but uses heuristic rules to identify the question words in the question. These simple syntactic problems in the problem set can achieve a good recognition effect. (5) for the case where the question does not match the words in the query document, the method of query expansion based on Wikipedia is used. Including wiki page search, the location of relevant paragraphs and the selection of extension words. Experimental results show that this method can effectively improve the quality of retrieval results. (6) in order to further improve the accuracy of retrieval results, the system also uses BM25 algorithm to rearrange the retrieval results in the document rearrangement module. The final retrieval results are obtained after the rearrangement.
【学位授予单位】：武汉科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】