面向文本分类任务的主题强化词句嵌入模型研究

发布时间：2018-12-17 09:02

【摘要】：近年来,深度学习在自然语言处理领域受到了更多的重视,基于深度学习的神经语言模型和词句嵌入模型相继被提出,这类模型以其高准确率、低复杂度的优点被学术界和工业界广泛研究和应用。然而,将原本依赖语言模型分布假设的词句嵌入模型直接用于文本分类等任务,显然是不合适的,因为文本分类任务所需要的是高极性的主题特征,而原词句嵌入模型只是单纯的捕捉语言规律,没有重视主题信息的挖掘。为了使基于深度学习的词句嵌入模型更加适合应用到文本分类任务中,本文对原模型进行主题强化,提出了主题强化的词句嵌入模型,期望获得更高的文本分类性能。由于语义极性相反的单词可能拥有相似的局部上下文,而原模型只利用局部上下文训练该单词的分布式嵌入表示,是无法捕捉到具有相反极性的语义的。因此,本文提出用高阶纯依赖建模词句嵌入模型中的长程上下文,从而加强词句分布式嵌入表示的情感或者主题信息,进而提高情感分析和主题挖掘任务的性能。高阶纯依赖方法有严格的理论依据保证长程上下文单词间的依赖是“纯”的,即单词依赖是一个完整的语义实体,并且单词的联合概率分布不能够被条件分解(当然也不能被非条件分解)。这样保证了高阶的单词依赖不能够分解成几个低阶依赖的随机共现,从而高阶纯依赖可以有效地建模出语义丰富的、非歧义的主题信息。本文将主题强化的词句嵌入模型应用到基于标准数据集的情感分析和主题挖掘任务中,均超过了所有现有模型的性能。在中文新闻语料的分类项目中,与词袋模型、LDA主题模型特征作对比,分别应用了线性和非线性分类器,从多角度调研了其分类结果,证明了主题强化的词句嵌入模型完全可以与现有主流文本特征提取方法相竞争。
[Abstract]:In recent years, more and more attention has been paid to deep learning in the field of natural language processing. Neural language models and sentence embedding models based on deep learning have been proposed one after another. The advantages of low complexity have been widely studied and applied in academia and industry. However, it is obviously inappropriate to embed words and sentences that rely on the hypothesis of linguistic model distribution to be directly used in tasks such as text categorization, because the task of text categorization requires highly polar thematic features. The original sentence embedding model only captures the language rules and does not pay attention to the topic information mining. In order to make the word-sentence embedding model based on in-depth learning more suitable for the task of text classification, this paper proposes a topic enhancement model for the original model, which is expected to achieve higher text classification performance. Because a word with opposite semantic polarity may have similar local context, the original model can only use local context to train the distributed embedded representation of the word, so it is impossible to capture the semantic with opposite polarity. Therefore, this paper proposes to embed the long term context in the model with high order pure dependency, so as to enhance the emotional or topic information expressed by the distributed embedding of words and phrases, and then improve the performance of emotion analysis and topic mining tasks. The high-order pure dependency method has strict theoretical basis to ensure that the dependency between words in long term context is "pure", that is, word dependency is a complete semantic entity. And the joint probability distribution of words can not be decomposed by condition (and certainly not by non-conditional decomposition). This ensures that high-order word dependencies cannot be decomposed into several low-order dependencies of random co-occurrence, so that high-order pure dependencies can effectively model semantic rich, non-ambiguous subject information. In this paper, we apply the topic enhanced sentence embedding model to the emotional analysis and topic mining tasks based on the standard data set, which is superior to the performance of all the existing models. In the classification items of Chinese news corpus, compared with word bag model and LDA thematic model, linear and nonlinear classifiers are used, and the classification results are investigated from many angles. It is proved that the topic-enhanced word-sentence embedding model can compete with the existing mainstream text feature extraction methods.
【学位授予单位】：天津大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】