开放域问答系统答案源获取方法研究与实现

发布时间：2018-07-02 20:47

本文选题：自动问答系统 + 答案源获取　；参考：《太原理工大学》2012年硕士论文

【摘要】：当今社会,互联网中所包含的种类繁多内容丰富的知识资源,为我们日常学习和工作中面对问题时寻求帮助和获取信息提供了很大的方便。目前的Google和百度等搜索引擎是人们从网络中获取信息的主要途径,然而,这些传统的搜索引擎随着用户对信息精确性和时间高效性要求的提高,暴露出一些弊端,例如,它按照关键词组合的形式分析用户输入的查询语句,这会对用户的搜索目的产生偏差,返回给用户的结果是大量网页的集合,需要用户去甄别和查找,而非用户希望得到的准确简洁的答案。在传统搜索引擎的基础上,新一代的自动问答系统因为其高效实用的特点,成为信息检索领域的研究热点和趋势。一方面,它方便用户使用自然语言提问,另一方面,返回给用户的是最终的答案,具有较高的理论研究价值和广阔的应用前景。自动问答系统一般主要包括问题分析,信息检索和答案抽取三个模块。其中答案抽取是问答系统的最后关键步骤,能否做好这一步关系着提交给用户的答案是否准确和高效。本文主要针对最后一步答案源获取方法进行研究,结合前人的研究成果,在Web网页的抓取,网页去重,网页信息提取等方面进行了研究,主要进行了以下工作： (1)针对用户提出的问题在Web中搜寻对应的答案网页,在传统搜索引擎的平台上,将相关的答案网页保存到本地。在本实验设计中,我们借助百度知道的知识库,通过Crawler爬虫程序,依据相应的抓取算法,从URL链向深度和广度抓取一定数量的网页,作为我们下一步信息提取的答案源库。 (2)在抓取网页文档的过程中,针对网络中存在的大量内容相同和相似的网页,会增加系统的开销和降低效率。通过借鉴前人在网页去重方面的相关研究成果,引入了基于文本块,利用shingle和基于集合统计的网页去重方法,并给出了测评的标准。 (3)在对网页文档信息提取的过程中,可以将网页标签,无关的广告和图片等信息进行过滤,利用DOM树的节点结构来结构化表示网页内容,从节点中提取出网页文档的文本信息,为后续的答案提取做准备。设计实验方案,给出相关说明。
[Abstract]:In today's society, there are many kinds of knowledge resources in the Internet, which provide great convenience for us to seek help and obtain information when facing problems in our daily study and work. At present, search engines such as Google and Baidu are the main ways for people to obtain information from the Internet. However, these traditional search engines have exposed some disadvantages with the improvement of users' requirements for information accuracy and time efficiency, such as, It analyzes the query statements input by the user according to the form of keyword combination, which will cause deviation to the user's search purpose. The result returned to the user is a large number of web pages, which need to be identified and searched by the user. Rather than the exact and succinct answers that users want. Based on the traditional search engine, the new generation of automatic question answering system has become the research hotspot and trend in the field of information retrieval because of its high efficiency and practicality. On the one hand, it is convenient for users to use natural language to ask questions. On the other hand, it returns the final answer to users, which has high theoretical research value and broad application prospect. The automatic question answering system includes three modules: question analysis, information retrieval and answer extraction. The answer extraction is the last key step in the question answering system. Whether it can be done well or not is related to whether the answer submitted to the user is accurate and efficient. In this paper, the last step of the source of the answer to the source of the study, combined with previous research results, in the Web page grab, web pages to heavy, web information extraction and other aspects of research. The main work is as follows: (1) search the corresponding answer pages in the Web for the user's questions, and save the relevant answer pages to the local on the platform of the traditional search engine. In this experiment design, we use the knowledge base that Baidu knows, through Crawler crawler program, according to the corresponding crawling algorithm, we grab a certain number of web pages from URL chain to depth and breadth. (2) in the process of crawling web pages, a large number of web pages with the same and similar content in the network will increase the cost of the system and reduce the efficiency. By referring to the related research results of previous researches on web page removal, this paper introduces a method based on text block, which uses shingle and set statistics to remove the weight of web pages. The evaluation standard is given. (3) in the process of extracting web document information, we can filter the information such as page label, irrelevant advertisement and picture, and use the node structure of Dom tree to structurally represent the web page content. The text information of the web page document is extracted from the node to prepare for the subsequent answer extraction. The experimental scheme is designed and the related explanation is given.
【学位授予单位】：太原理工大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.092

【参考文献】