非事实类问题的答案选取
发布时间:2018-08-07 13:50
【摘要】:随着问答社区网站的兴起,越来越多的用户生成数据积累了起来。这些用户生成数据不仅具有海量的、多样性的等特点,还有着极高的质量和重用价值。为了高效的管理和利用这些数据,近年来研究人员基于这些数据进行了大量的研究和实践,而社区问答就是一个被广泛研究的课题。 社区问答研究基于问答社区数据,与传统的问答系统有着明显的不同。传统问答系统主要解决以短语和命名实体为答案的事实类问题,主要模块是问题理解和答案抽取。而社区问答则没有这样的限制,并且其特别适合回答询问建议或观点的非事实类问题。社区问答研究涵盖问题检索与推荐、问题的兴趣度、问题和答案的质量、答案的排序、用户权威性等研究方向。其中问题检索和答案的选取作为社区问答的核心模块更是受到了学术界和工业界的广泛关注。 本课题主要工作为构建一个基于大规模问答社区数据的社区问答系统,并对其中涉及的问题分析、问题检索和答案选取技术进行了深入的研究。 社区问答系统构建过程中,本课题收集了来自Yahoo! Answers等社区网站的超过1.3亿问题和10亿答案的大规模数据,和之前的基于百万量级的数据的问答社区相关研究工作相比有着明显的不同和极高的实用价值。在此数据的基础上,,本课题通过查询自动分类方法来提高每次查询效率和效果。 在问题检索过程中,本课题提出了应用查询问句和候选问题的结构信息和语义信息,并结合排序学习算法来融合多种不同类别的特征。通过训练数据生成排序模型来提高问题检索的相关性和词语不匹配等问题。实验表明,本课题应用Ranking SVM方法来训练的排序模型在不同数据集上,其准确率等评价指标上都相比以往的方法有着显著的提高。 在通过问题检索找到与查询问句语义相似的候选问题后,本课题还提出了一个基于问答对的内容信息的新的无监督学习方法,来判定答案的质量以过滤低质量的答案。本课题对问答社区中的数据有以下三个假设:1、一个问题下的大部分答案都是正常的,只有少部分答案是低质量的需要被过滤掉;2、低质量答案可以通过对比同一问题下的其他答案而被检测出来;3、不同的答案应该有不同的判定答案质量高低的标准。基于以上假设,本课题应用基于内容的特征,通过最小化答案特征向量的方差,同时尽可能多的保留答案的方式来对检测低质量答案。实验表明,该方法相比于基准方法在ROC数值上有着明显的提高。 在低质量答案过滤之后,本课题还应用问答对的文本信息和社区网站回答者的权威性信息,通过问答社区中的用户选出的最佳答案数据和Ranking SVM算法训练了一个答案排序模型,来对答案进行重新排序选取最佳的答案。通过以上几个步骤,本课题构建了一个高效、实用的社区问答系统,通过300个商业搜索引擎查询日志中高频问题的测试,有78.0%的问题可以给出正确的答案,并对于任意问句可在2秒中内给出结果,该社区问答系统具有很好效果与实用性。
[Abstract]:With the rise of the question and answer community, more and more user generated data have been accumulated. These users generate data not only with mass, diversity, but also of high quality and reuse. In order to manage and use these data efficiently, researchers have done a lot of research on these data in recent years. And practice, and community Q & A is a widely studied subject.
The community question and answer study is based on the question and answer community data, which is obviously different from the traditional question answering system. The traditional question answering system mainly solves the fact class problem with the answer of the phrase and the named entity. The main module is the problem understanding and the answer extraction. The community question answer is not limited, and it is especially suitable for answering questions and ideas. The community question and answer research covers the search and recommendation of the problem, the degree of interest, the quality of the questions and answers, the order of the answers, the authority of the user and so on. The key module of the question and answer of the question is the attention of the academia and the industry.
The main work of this project is to build a community Q & a system based on the mass question and answer community data, and make an in-depth study of the problems involved in the problem analysis, the problem retrieval and the answer selection technology.
In the process of community Q & a system construction, this subject has collected more than 130 million questions and 1 billion answers from the community websites of Yahoo! Answers and so on. It has significant difference and high practical value compared with the previous question and answer community related research based on millions of data. On the basis of this data It improves the efficiency and effectiveness of each query by querying automatic classification.
In the process of problem retrieval, this topic puts forward the structure and semantic information of query questions and candidate questions, and combines the sorting learning algorithm to merge the characteristics of various different categories. Through training data generating sorting model to improve the correlation of problem retrieval and the mismatch of words, the experiment shows that this topic is applied to Ran The ranking model trained by King SVM has a remarkable improvement in accuracy and other evaluation indexes compared with the previous methods on different data sets.
A new unsupervised learning method based on QA based content information is proposed to find the quality of answers to filter low quality answers. This subject has three hypotheses in the question and answer community: 1, a large part under a problem. Only a few answers are normal, only a few answers are low quality needs to be filtered out; 2, low quality answers can be detected by comparing other answers to the same problem; 3, different answers should have different criteria for determining the quality of the answers. Based on the above hypothesis, the subject applies the features based on content, through the above hypothesis. The variance of the answer eigenvectors is minimized and the answers are kept as many as possible to detect low quality answers. Experiments show that the method has a significant increase in the ROC value compared to the benchmark method.
After the low quality answer filtering, the subject also uses the text information of the question answer pair and the authoritative information of the responders of the community website, and trains an answer sorting model through the best answer data selected by the user in the question and answer community and the Ranking SVM algorithm, to select the best answer to the answer by a new sort. Step, this project constructs an efficient and practical community Q & a system, and through 300 commercial search engines to test the high frequency problem in the log, 78% of the questions can give the correct answer, and the question can be given the result in 2 seconds. The community question answering system has good effect and practicability.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
[Abstract]:With the rise of the question and answer community, more and more user generated data have been accumulated. These users generate data not only with mass, diversity, but also of high quality and reuse. In order to manage and use these data efficiently, researchers have done a lot of research on these data in recent years. And practice, and community Q & A is a widely studied subject.
The community question and answer study is based on the question and answer community data, which is obviously different from the traditional question answering system. The traditional question answering system mainly solves the fact class problem with the answer of the phrase and the named entity. The main module is the problem understanding and the answer extraction. The community question answer is not limited, and it is especially suitable for answering questions and ideas. The community question and answer research covers the search and recommendation of the problem, the degree of interest, the quality of the questions and answers, the order of the answers, the authority of the user and so on. The key module of the question and answer of the question is the attention of the academia and the industry.
The main work of this project is to build a community Q & a system based on the mass question and answer community data, and make an in-depth study of the problems involved in the problem analysis, the problem retrieval and the answer selection technology.
In the process of community Q & a system construction, this subject has collected more than 130 million questions and 1 billion answers from the community websites of Yahoo! Answers and so on. It has significant difference and high practical value compared with the previous question and answer community related research based on millions of data. On the basis of this data It improves the efficiency and effectiveness of each query by querying automatic classification.
In the process of problem retrieval, this topic puts forward the structure and semantic information of query questions and candidate questions, and combines the sorting learning algorithm to merge the characteristics of various different categories. Through training data generating sorting model to improve the correlation of problem retrieval and the mismatch of words, the experiment shows that this topic is applied to Ran The ranking model trained by King SVM has a remarkable improvement in accuracy and other evaluation indexes compared with the previous methods on different data sets.
A new unsupervised learning method based on QA based content information is proposed to find the quality of answers to filter low quality answers. This subject has three hypotheses in the question and answer community: 1, a large part under a problem. Only a few answers are normal, only a few answers are low quality needs to be filtered out; 2, low quality answers can be detected by comparing other answers to the same problem; 3, different answers should have different criteria for determining the quality of the answers. Based on the above hypothesis, the subject applies the features based on content, through the above hypothesis. The variance of the answer eigenvectors is minimized and the answers are kept as many as possible to detect low quality answers. Experiments show that the method has a significant increase in the ROC value compared to the benchmark method.
After the low quality answer filtering, the subject also uses the text information of the question answer pair and the authoritative information of the responders of the community website, and trains an answer sorting model through the best answer data selected by the user in the question and answer community and the Ranking SVM algorithm, to select the best answer to the answer by a new sort. Step, this project constructs an efficient and practical community Q & a system, and through 300 commercial search engines to test the high frequency problem in the log, 78% of the questions can give the correct answer, and the question can be given the result in 2 seconds. The community question answering system has good effect and practicability.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【相似文献】
相关期刊论文 前10条
1 贾君枝;毛海飞;;汉语框架网络问答系统问句处理研究[J];图书情报工作;2008年10期
2 王君;李舟军;胡侠;胡必云;;一种新的复合核函数及在问句检索中的应用[J];电子与信息学报;2011年01期
3 党琰,张冬茉,李芳;角色反演算法在问答系统中的应用[J];计算机工程与应用;2004年36期
4 张晓孪;王西锋;;中文问答系统中语义角色标注的研究与实现[J];科学技术与工程;2008年10期
5 秦兵,刘挺,王洋,郑实福,李生;基于常问问题集的中文问答系统研究[J];哈尔滨工业大学学报;2003年10期
6 付鸿鹄;基于W eb的开放领域问答系统研究[J];现代图书情报技术;2005年09期
7 高明霞;刘椿年;;基于模糊描述逻辑的PNL网络问答系统[J];计算机工程;2006年21期
8 王树西;赵星秋;潘硕;;问答系统在教学中的应用[J];中国教育信息化;2007年07期
9 杜玮;邸书灵;孙树静;;基于互联网技术的问答系统研究[J];微计算机信息;2007年36期
10 陈敏杰;;问答系统中问题分析模块的实现[J];经营管理者;2009年13期
相关会议论文 前10条
1 何靖;陈
本文编号:2170221
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2170221.html