Web社区问答检索的关键技术研究

发布时间：2018-06-17 05:29

本文选题：社区问答服务 + 答案摘要　；参考：《复旦大学》2014年博士论文

【摘要】：社区问答服务是指人们通过web社区相互交流来提出问题和获取解答的服务。由于社区问答系统中包含了许多真实人工用户生成的知识和经验分享,它已经成为人们常用的除传统搜索引擎以外比较流行的信息求助方式。在社区问答系统中,用户可以基于自然语言表达的方式提交问题直接向社区中的其他用户寻求答案,也可以通过自动检索得到与该提问相似的问题,并进一步使用相似问题的现成答案。对于大多数非事实性问题特别是一些带个人上下文或寻求建议的开放性问题,问题检索往往比基于自然语言处理和信息检索从web文档中抽取文档片段并提取答案的传统方法更加有效。正因为如此,针对web社区一般性问题的检索已经成为下一代智能信息检索的一个重要组成部分。稀疏化学习是近年来兴起的新型统计学习方法。本文以稀疏正则化为主要工具,对社区问答中的一系列关键技术开展研究。具体而言,本文主要研究了web社区中复杂多语句问题的答案摘要技术,问题的自动层次话题分类技术和问题检索模型的改进技术。本文的主要工作和创新点如下：1.自动答案摘要：对于社区中的复杂多语句问题,即那些往往包含很多子问题和相应上下文的问题,其“最佳答案”往往存在所谓的“答案不完整”缺陷--它对应的“最佳答案”不够全面完整,缺失了其它答案中包含的对问题回答有用的信息。本文提出一种新颖的自动答案摘要方法来归纳问题的所有答案中的有价值的信息。该方法基于条件随机场模型来对答案句子间的局部/非局部上下文关系进行建模,并使用组L1正则化来对参数进行惩罚,充分挖掘各特征的潜能。2.问题层次分类：用户在社区问答系统上提交问题时,系统要求用户为该问题手工选择一个层次目录来表明问题的话题类别,这样既方便系统将该问题推荐给相应话题的领域专家去解答,也可以便利今后其他用户的浏览和检索。然而,手工给问题进行类别标注需要对整个层次目录体系有全面认识,因而既费时又影响用户体验。为了省去手工对问题进行话题分类的麻烦,本文提出一种自动的问题核化层次话题分类算法,将问题中各特征的多核学习和参数的稀疏正交约束结合起来,在提升模型对相似话题类别的判别能力的同时减少了模型的参数个数。3.问题检索模型：为了进一步提高社区问答中已有问题的可用性,本文研究了基于自动分类结果改善问题检索效果的技术。现有的检索模型在度量某个查询词对该查询的重要性时往往按其在查询中出现的频率来计算,这对于那些每个查询词只出现一次的情形不起作用。与现有的检索方法不同,我们使用稀疏化的问题分类方法来模拟真实用户的层次类别标注过程,并且根据该过程来自动挑选其中的重要检索词项和获取其对该查询的局部权重。另外,我们还对初步检索结果进行基于结果间相似度的重排序,进一步提高问题检索的性能。本文的多数方法都使用带有稀疏性质的正则化项来约束模型的参数。这样做有几个好处：第一,减少了模型的参数。由于特征的减少,模型所需要的训练数据也相应地减少,防止了模型因为参数太多而过拟合的情况,并且增强了在新数据上的泛化能力；第二,提高了模型的效率。由于参数的减少,用于存储模型的空间和计算时间也有所降低；第三,有利于关系依赖的发现。通过稀疏化模型将那些干扰性的无关项去除后,模型能更加专注于那些真正对模型推理有帮助的特征。因此,本文中提出的稀疏化方法除了对社区问答检索比较有帮助,在其它web应用如冗长关键字检索、web文档分类和摘要上也有一定的启发意义。在真实社区问答数据集Yahoo! Answers上的一系列实验结果表明,本文提出的方法无轮是与当前较为先进的研究方法还是与一些强基准方法相比,准确度都取得了明显的提高。
[Abstract]:Community question and answer service refers to the service that people communicate with each other through the web community. Because of the knowledge and experience shared by many real artificial users in the community question answering system, it has become a popular way of seeking information, which is popular except for the traditional search engine. In the community question and answer system, the community Q & a system has been used in the community question answering system. In the system, users can submit questions based on the natural language expression to seek answers directly to other users in the community, or by automatically retrieving questions similar to the question, and using a ready-made answer to similar questions. For most non factual questions, especially some with individual contexts or for advice. In the open problem, problem retrieval is often more effective than the traditional method based on Natural Language Processing and information retrieval to extract document fragments and extract answers from web documents. For this reason, the retrieval of general problems in the web community has become an important part of the next generation of intelligent information retrieval. In this paper, a series of key technologies in community questions and answers are studied in this paper. In this paper, the paper mainly studies the answer summary technology of the complex and multiple sentences in the web community, the automatic hierarchical problem classification and the improvement of the problem retrieval model. The main work and innovation of this paper is as follows: 1. automatic answer summary: for the complex multiple statement problem in the community, that is, the problems that often contain many sub problems and corresponding contexts, the "best answer" often has the so-called "incomplete answer" defect -- its corresponding "best answer" is not complete and complete, missing. This paper presents a novel automatic answer summary method to sum up valuable information in all the answers to the problem. This method is based on the conditional random field model to model the local / non local contexts between the answers, and use group L1 regularization to make the reference to the reference. The number carries on the punishment, fully excavates the potential.2. problem hierarchy classification of each characteristic: when the user submits a question on the community question answering system, the system requires the user to choose a hierarchical directory to show the topic category by hand, so that it is convenient for the system to recommend the problem to the domain experts of the corresponding topic, and it can also be solved. In order to save the problem of sorting the problem by hand, an automatic problem kernel hierarchical topic classification algorithm is proposed. Multi kernel learning and sparse orthogonal constraints of parameters are combined to improve the model's discriminant ability to similar topic categories and reduce the model.3. problem retrieval model. In order to further improve the availability of existing problems in community questions and answers, this paper studies the problem retrieval based on automatic classification results to improve the problem retrieval. The existing retrieval model, when measuring the importance of a query word to the query, is often calculated according to the frequency of the query appearing in the query, which does not work for the case that each query only appears once. Unlike the existing retrieval methods, we use a thinning problem classification method to simulate real users. According to the process, the important retrieval words are selected and the local weight of the query is obtained. In addition, we also reorder the initial retrieval results based on the inter result similarity degree to further improve the performance of the problem retrieval. Most of the methods used in this paper use a sparse character. Regularization terms constrain the parameters of the model. There are several benefits: first, the parameters of the model are reduced. Due to the reduction of the characteristics, the training data required by the model are reduced accordingly. The model is prevented from overfitting the model because of too many parameters, and the generalization ability on the new data is enhanced; second, the effect of the model is improved. As a result of the reduction of parameters, the space and time for storage models have also been reduced; third, the discovery that is beneficial to relation dependence. After the removal of those independent items by the sparsity model, the model can be more focused on the features that are really helpful to the model reasoning. Therefore, the sparsity method proposed in this paper is the exception. It is helpful to the community question and answer retrieval, in other web applications such as verbose keyword search, web document classification and summary also have some enlightening significance. A series of experimental results on the real community Q & a data set Yahoo! Answers show that the method proposed in this paper is with the more advanced research methods or some of the more advanced methods. Compared with the strong benchmark method, the accuracy has been significantly improved.
【学位授予单位】：复旦大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP391.3

【相似文献】