面向高考作文的题意分析和生成技术研究

发布时间：2018-03-14 19:44

本文选题：文本标签推荐　切入点：深度神经网络　出处：《哈尔滨工业大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着近几年来人工智能的飞速发展,我们越来越想检验一下机器能达到一个什么样的智能水平。为此,国家在2015年启动了“高考答题机器人”的相关项目研究,而自动解答高考作文题则是其中的一个重点研究课题。我们针对这一课题,对作文题意分析和文本生成技术两方面进行了深入的研究。作文题意分析就是给定一篇作文题目,从中提炼得到一个话题词集合,这些话题词明确了写作内容。针对这一任务,我们利用规则匹配和关键词抽取方法能够处理大约40%的作文题目。而对于剩下部分的作文题目分析,我们将其视为成一种特殊的文本标签推荐任务,这也是题意分析部分的重点研究内容。考虑到任务的特殊性,我们提出了基于层次化深度神经网络的模型。首先,我们利用GRU或CNN学习得到句子向量表示,然后以句子向量作为输入,利用句子层GRU得到文本向量表示,将文本向量作为特征输入到逻辑回归模型中预测每个候选标签词的置信度。实验证明,基于层次化深度神经网络的模型在训练数据充足的情况下,能够获得优于其他模型方法的结果,F1值最高能有8个百分点的提升。虽然基于层次化深度神经网络的模型在作文题意分析任务上能够取得非常好的效果,但是却需要较大规模的训练语料,然而大规模语料的获取往往是费时费力的,所以,我们又提出了将深度神经网络的和迁移学习相结合的方法。我们首先在源领域训练深度神经网络模型,然后利用迁移学习方法在目标领域再次进行训练,利用源领域学到的知识来帮助目标领域上的学习。在两个数据集上的实验证明了基于迁移学习方法显著优于有监督学习方法,在豆瓣数据集上F1值最高能达到7个百分点的提升,在作文题目数据集上P@3值最高能提升31.4个百分点。在文本生成技术研究方面,我们主要关注符合多主题的段落级文本生成问题。我们希望模型能够接受多个话题词的控制,生成包含这个多个话题词语义的一段文本。为此,我们提出了Coverage-based LSTM模型。在该模型中,我们构建了一个多主题的Coverage向量,它学习每个话题词的权重并且在生成过程中不断更新。然后,该向量输入到注意力网络中,用于指导文本生成。此外,我们还自动构建了两个段落级的中文作文语料,包含305,000个作文段落和56,621个知乎文本。实验表明,我们的模型在BLEU指标上相比于其他模型获得了更好的结果。而且,人工评价结果表明Coverage-based LSTM模型有能力生成连贯并且和输入话题词相关的文本。
[Abstract]:With the rapid development of artificial intelligence in recent years, we are more and more interested in testing what kind of intelligence level the machine can achieve. In view of this, we have made a thorough study on both the meaning analysis of composition questions and the text generation technology. The meaning analysis of composition questions is a given composition topic. Extract a collection of topic words that define the content of the writing. We use rule matching and keyword extraction methods to deal with about 40% of the composition topics. For the rest of the composition topic analysis, we see it as a special text label recommendation task. Considering the particularity of the task, we propose a model based on hierarchical depth neural network. First, we use GRU or CNN to learn sentence vector representation. Then the sentence vector is used as input, the text vector representation is obtained by sentence level GRU, and the text vector is input into the logical regression model to predict the confidence of each candidate label word. The model based on hierarchical depth neural network has sufficient training data. The F1 value can be improved by up to 8 percentage points. Although the model based on hierarchical depth neural network can achieve very good results in the task of composition meaning analysis, However, large scale training data is needed. However, the acquisition of large scale data is often time-consuming and laborious, so, We also propose a method that combines the depth neural network with the transfer learning. We first train the deep neural network model in the source domain, then we use the transfer learning method to train again in the target domain. The experiments on two datasets show that the migration-based learning method is superior to the supervised learning method. The maximum value of F1 was increased by 7 percentage points on the data set of soybean petal, and the maximum value of Pol _ 3 was increased by 31.4 percentage points on the data set of composition topic. We are mainly concerned with paragraph level text generation that conforms to multiple topics. We hope that the model can be controlled by multiple topic words to generate a text that contains the meaning of this multi-topic word. We propose the Coverage-based LSTM model. In this model, we construct a multi-topic Coverage vector, which learns the weight of each topic word and updates constantly during the generation process. Then, the vector is input into the attention network. In addition, we have automatically constructed two paragraph level Chinese composition corpus, which contains 305,000 composition paragraphs and 56,621 Zhihu texts. Our model has better results than other models in terms of BLEU index. Furthermore, the artificial evaluation results show that the Coverage-based LSTM model is capable of generating coherent text related to the input topic words.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】