社区问答系统中非事实性问题的答案摘要算法研究

发布时间：2017-12-27 12:07

本文关键词：社区问答系统中非事实性问题的答案摘要算法研究　出处：《山东大学》2017年硕士论文　论文类型：学位论文

【摘要】：最近几年,我们可以看到,社区问答系统中的用户数量正呈现出高速增长的态势。社区问答系统给用户提供了一个发布问题以及寻找答案的平台,而这个广大的平台中所包含的海量的问题与答案对数据,也逐渐成为了国内外科研人员的新的研究热点。此前已经有很多论文关注了社区问答系统场景下的多个研究课题,而在本学位论文中,我们关注的主要任务是社区问答系统中的答案摘要问题。虽然大多数之前的研究工作主要关注的是事实性问题,在本学位论文中,我们的工作重点则是非事实性问题。在事实性社区问答系统中,问题通常是寻求一个确定的答案,而问题的答案大多数都是单独的句子,与之不同的是,非事实性问题往往是在寻求看法、观点、意见,因此,非事实性问题通常需要用多个句子、甚至是整篇文章来作为答案。传统的多文档摘要任务主要是针对新闻文章,与之相比,在非事实性社区问答系统中的答案摘要就面临着其独特的挑战:答案句子的简短性、稀疏性,以及答案内容的多样性。为了解决这些挑战,我们提出了一个包含了三个核心要素的、基于稀疏编码的答案摘要策略:答案句子的短文本扩充,句子的向量化表示,以及稀疏编码优化框架。具体来说,通过实体链接和基于问题答案句子排序的策略,我们把一个问题下的每一个答案句子扩展成包含了多个维基百科句子组成的更复杂的表示。在此基础之上,每个句子都通过一个基于短文本的卷积神经网络模型被表示成一个特征向量。之后我们利用这些句子的向量表示,提出了一个稀疏编码的优化框架,通过同时考虑候选答案句子以及辅助的维基百科句子,来评估所有候选句子的独特性得分。在得到了这些候选答案句子的独特性得分之后,基于最大边界相关性算法,我们抽取出得分最高的答案句子,来产生最终的答案摘要。我们在本学位论文中的主要贡献是,通过处理非事实性问题中答案句子的简短性和稀疏性,以及答案内容的多样性这三个问题,我们解决了社区问答系统中非事实性问题的答案摘要问题。另外,我们在一个公开的基准数据集上进行了实验,并与一些当下最新的基准实验方法进行了比较,以评估我们提出的非事实性社区问答系统中的答案摘要方法的性能。相关实验结果不仅证实了我们提出的方法的有效性,而且相较于最新的研究方法,我们提出的方法在ROUGE评价指标上有着显著提升。此外,进一步的实验结果分析,也说明了我们提出的算法具有良好的稳定性和扩展性。
[Abstract]:In recent years, we can see that the number of users in the community Q & a system is showing a rapid growth trend. Community question answering system provides users with a platform for publishing questions and finding answers. Massive problems, answers and data contained in this vast platform have gradually become a new research focus of researchers at home and abroad. Many papers have been concerned about many research topics in the community QA system. In this dissertation, we focus on the answer questions in community question answering system. Although most of the previous work focuses on factual issues, in this degree thesis, our focus is on non factual issues. In fact the community question answering system, the problem is usually to seek a definitive answer, but most answers are separate sentences, and the difference is that the non factual issues are often seeking opinions, views, opinions, therefore, non factual problems usually need to use more than one sentence, even the whole article is the answer. The traditional multi document summarization task is mainly aimed at news articles. Compared with the answer questions in non factual community question answering system, it faces unique challenges: the short sentence, the sparsity of the answer sentences, and the diversity of the answer content. To solve these challenges, we propose a sparse coding based answer summarization strategy which consists of three core elements: short sentence expansion of answer sentences, quantitative representation of sentences, and sparse coding optimization framework. Specifically, through the physical link scheduling problem and strategy based on the answer sentence, we have a problem for every answer sentence is extended to contain multiple Wikipedia sentences to represent more complex. On this basis, each sentence is represented as a feature vector by a convolution neural network model based on short text. Then we use the vector representation of these sentences, put forward the optimization framework of a sparse encoding, by considering the candidate answer sentence and auxiliary Wikipedia sentences to evaluate all candidate sentences unique score. After getting the scores of these candidate answers, we extract the highest scoring answer sentences based on the maximum boundary correlation algorithm to generate the final answer summaries. Our main contribution in this dissertation is to solve the three questions of non fact questions in the community question answering system by dealing with the following questions: short and sparse sentences, and variety of answers. In addition, we conducted experiments on an open benchmark dataset, and compared with some recent benchmark experimental methods to evaluate the performance of the answer summarization method in our non factual community question answering system. The related experimental results not only confirm the effectiveness of our proposed method, but also compare with the latest research methods, our proposed method has a significant improvement in ROUGE evaluation index. In addition, further analysis of experimental results shows that the proposed algorithm has good stability and scalability.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】