社区型问答中问句检索关键技术研究

发布时间：2018-06-19 19:16

本文选题：社区型问答 + 问句检索　；参考：《哈尔滨工业大学》2014年博士论文

【摘要】：随着Web2.0时代的到来，社区型问答渐渐成为人们在网络上获取知识和信息的必要途径。相对于互联网搜索引擎而言，社区型问答能够直接返回用户提出的自然语言形式问句的答案，而不是需要用户自行筛选的检索结果列表。相对于传统的开放域问答系统而言，社区型问答中的答案都是由真实用户生成的，其质量要高于传统的开放域问答系统自动地从候选文档中抽取和生成的答案。同时，由于社区型问答中积累了大量的问答对资源，因此，社区型问答中的核心问题和关键技术体现在检索相似的已回答问句并返回相应的答案，我们称之为问句检索。然而，社区型问答中的问句检索面临的三个主要挑战为：由于用户问句表述的冗长性导致的用户意图理解困难；由于用户问句表述多样性造成的问句之间的词项不匹配问题；由于未能考虑问句的社区属性而导致问句检索的排序仅仅依靠文本相关性。因此，在本文中，我们从以下四个方面来解决上述三个关键问题，从而在整体上提高社区型问答中问句检索的性能。本文的第二章提出了基于依存句法关系图的词项重要度赋权方法，从而解决了社区型问答中用户问句查询的冗长性问题。具体地，对于已有的基于词项赋权的问句检索模型而言，一个主要的问题是在计算词项权重时忽略了词项之间的联系。为了解决这个问题，我们提出了一种新的利用词项之间依存句法关系作为线索的词项赋权机制。对于给定问句，我们首先构建依存句法图来计算每个词项对的关联强度，进而我们根据依存关联度来更新常规的词项权重。我们验证了更新后的词项权重能够有效地整合到已有的问句检索模型中，且实验结果相比于已有最新颖的问句检索模型有了显著的提升。本文的第三章提出了基于短语复述的问句重构模型，提高了问句查询扩展的整体效果。具体地，由于语言表述的多样性所导致的问句检索中的词项不匹配现象，已经成为社区型问答中亟待解决的问题。为了解决这个问题，我们提出了一种基于短语级复述方法的问句重构机制，从而提高了问句检索的效果。给定一个问句查询，我们首先结合语料库统计信息和问句内部线索的特征来识别问句中的关键短语；接下来，我们通过融合多个在线翻译引擎的翻译结果来进行关键短语的复述抽取；最后，我们提出一种基于解码算法的问句重构方法，在融合关键短语的基础上，生成重构问句。通过在社区型问答数据集上的问句检索实验效果的提升，验证了我们所提出的问句重构算法的有效性，并且在问句检索上显著优于当前的最新颖的问句检索模型。本文的第四章提出了基于主题翻译及聚类模型，实现问句查询中词项的扩展。具体地，基于统计机器翻译模型的问句检索模型，其相关性排序机制主要依赖于词项间的翻译概率，然而已有的机器翻译模型没有很好地控制词项之间的翻译噪声，使得当前的问句检索模型存在不完善之处。我们提出一种基于主题翻译及聚类模型的问句检索模型，从理论上说明，该模型利用主题的推理及主题之间的相似性信息，达到控制翻译模型噪声的效果，从而提高问句检索的结果。实验结果表明，我们提出的模型在MAP、MRR以及p@1等指标上显著优于当前最新颖的问句检索模型。本文的第五章提出了问句流行度预测问题，并以此提高用户问句检索结果。具体地，随着社区型问答的发展，其上积累了大量高质量的问答对资源。这些资源不仅能够让用户进行问句检索的操作，更重要的是允许用户之间进行交互。在问答社区上面，大多数研究都是基于问句的文本内容进行问句检索的相关研究，而很少有研究用户个人信息及交互行为对问句检索结果的影响。社区型问答中，问句的流行度能够反映用户的关注、兴趣以及交互行为，因此，，我们通过预测问句的流行度来改善用户在问句检索时的体验。我们首先通过对影响问句流行度的因素进行分析和建模，以此来预测新问句的流行度。并通过预测出的流行度对用户使用问句检索的结果进行重排序，实验结果表明，基于流行度重排序的问句检索结果优于基于检索相关度的问句检索结果。
[Abstract]:With the advent of the Web2.0 era, community interrogation has gradually become a necessary way for people to acquire knowledge and information on the Internet. Relative to Internet search engines, community type questions and answers can directly return to the answers to natural language questions raised by users, rather than the list of retrieval results that need to be screened by users themselves. In the open domain question answering system, the answers in the community type questions and answers are generated by the real users. Their quality is higher than the traditional open domain question answering system automatically extracts and generates the answers from the candidate documents. At the same time, a large number of questions and answers are accumulated in the community quiz. The key technology is to retrieve similar answer questions and return corresponding answers, which we call question search.
However, the three main challenges in the question answer search in the community type question answer are that the user's intention is difficult to understand because of the verbose description of the user's questions, and the problem of the mismatch between the words between the questions caused by the diversity of the user's question expression, and the sort of question retrieval due to the failure to consider the community attributes of the question. Therefore, in this article, we solve the above three key problems in the following four aspects, so as to improve the performance of the query in the community quiz.
The second chapter of this paper proposes a method of weighting the importance of word items based on dependency parsing graph, which solves the verbose problem of query in the question answer of the community type question and answer. In order to solve this problem, we propose a new word term empowerment mechanism that uses the interdependent syntactic relationship as a clue. For a given question, we first construct dependency parsing graph to calculate the correlation intensity of each word pair, and then we update the conventional word term weight according to the dependency correlation degree. The weight of the updated word item can be effectively integrated into the existing query model, and the experimental results have been improved significantly compared with the most novel query model.
In the third chapter of this paper, a question sentence reconstruction model based on phrase rehearsal is proposed to improve the overall effect of question query expansion. Specifically, the problem of word item mismatch in the query of question retrieval caused by the diversity of language expression has become an urgent problem in the community type question answer. In order to solve this problem, we put forward a new question. For a question sentence query, we first identify the key phrases in the question sentence combining the corpus statistics and the characteristics of the interal clues in a question. In the end, we propose a method of reconstructing the question sentence based on the decoding algorithm, which is based on the fusion of key phrases. Through the improvement of the experimental results on the question answer data set in the community type question and answer data set, we verify the validity of the question reconstruction algorithm and search the question sentence. It is significantly better than the current most novel query retrieval model.
The fourth chapter of this paper is based on topic translation and clustering model to realize the extension of word items in question query. Specifically, the query model based on statistical Machine Translation model is based on the probability of translation between words. However, the existing Machine Translation model does not control the translation between words well. Noise makes the current question retrieval model imperfections. We propose a query model based on topic translation and clustering model. In theory, the model uses the reasoning of the subject and the similarity information between subjects to control the effect of the noise of the translation model, thus improving the result of the question retrieval. The results show that our proposed model is significantly better than the current most innovative query retrieval model in terms of MAP, MRR and p@1.
The fifth chapter of this paper puts forward the question of the popularity of question and raises the result of user query. In particular, with the development of the community type question and answer, it has accumulated a large number of high quality questions and answers to the resources. These resources not only allow users to carry out the operation of query, but more importantly, allow users to interact. In answer to the community, most of the studies are based on interrogative text content for questions related to query, and few of the impact of user personal information and interactive behavior on query results. In community type questions and answers, the popularity of questions can reflect users' attention, interest and interaction behavior. Therefore, we predict the question through the question. The popularity of the sentence improves the user's experience in question retrieval. First, we analyze and model the factors that affect the popularity of the question sentences, in order to predict the popularity of the new questions, and reorder the user's query results through the predicted popularity. The experimental results show that the question based on the popularity reordering is the question. The result of sentence retrieval is better than that of query retrieval based on retrieval relevance.
【学位授予单位】：哈尔滨工业大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP391.3

【参考文献】