基于深度学习的主题建模方法研究

发布时间：2018-04-17 06:13

本文选题：神经网络主题模型 + 深度学习　；参考：《武汉大学》2017年硕士论文

【摘要】：主题模型是文本语义信息的抽取模型,也是文本语义表征的有效方法。通过主题建模,不仅可以识别出隐含在文本中的主题语义单元,还能够将文本表示在语义信息更为丰富的主题特征空间上,从而有助于文本分类聚类、突发事件检测、主题演化分析、推荐系统等任务。然而,传统的概率主题模型由于其浅层的特征结构以及概率化的生成模式依旧面临着模型可扩展性不足、主题语义连贯性不足、推断上下文一致性不足、特征表达能力不足等问题。深度学习技术的不断成熟为自然语言处理领域带来了新的发展契机,也为主题模型提供了新的构建思路。当前,词嵌入、知识嵌入、神经网络等深度学习方法在文本语义特征表达方面取得了突破性的进展,为构建深层次的具有语义连贯性的主题模型提供了可能。然而,基于深度学习的主题建模尚在起步阶段,如何将两者有机结合仍是一个亟待解决的研究课题。本文将深度学习技术运用到传统的文本主题建模过程中,旨在构建具有深层语义表征的深度主题模型。具体来说,本文工作主要分为以下三部分:首先,本文提出了基于知识库约束的词嵌入模型SG_TransE(Skip-Gram with TransE)来实现词汇的向量化表示。SG_TransE融合了 Skip-Gram模型和TransE知识翻译模型,能够使产生的词嵌入蕴含知识语义。其次,本文提出了基于深度语义强化的概率主题模型DGPU-LDA(Double Generalized Polya Urn with LDA)。该模型一方面结合本文设计的基于双向LSTM的文档语义编码框架 DS-Bi-LSTM(Document Semantic Bi-directional LSTM)来实现文档宏观语义的嵌入表示,另一方面采用文档-主题和词汇-词汇双GPU语义强化机制以及LSTM来刻画参数推断过程中的吉布斯采样过程。最后,本文对DGPU-LDA模型进行神经网络重构,提出了 NS-LDA(Neural Semantic LDA)模型。NS-LDA同样融合了 DS-Bi-LSTM文档语义编码框架,并使用隐藏层分别将文档-主题和主题-词汇两部分信息进行编码,进而通过乘积操作得到词汇在文档中的打分并将其作为神经网络的输出。在搜狗新闻数据集以及20新闻组数据集上的实验结果表明,本文提出的基于深度学习的主题模型DGPU-LDA以及NS-LDA在主题语义连贯性、文本分类准确率方面相对于一些比较前沿的主题模型具有一定的优势,同时也表明了本文提出的深度主题模型在文本语义特征表达方面的有效性。
[Abstract]:Topic model is not only a text semantic information extraction model, but also an effective method of text semantic representation.Through the topic modeling, not only can the topic semantic unit hidden in the text be identified, but also the text can be represented in the topic feature space with more abundant semantic information, which is helpful to the text classification and clustering, and the detection of unexpected events.Topic evolution analysis, recommendation system and other tasks.However, the traditional probabilistic topic models are still faced with insufficient extensibility, semantic coherence and contextual consistency due to their shallow feature structures and probabilistic generation patterns.Lack of feature expression and other problems.The maturity of deep learning technology brings new opportunities to the field of natural language processing and provides a new way to construct thematic models.At present, depth learning methods such as word embedding, knowledge embedding and neural network have made a breakthrough in the representation of semantic features of text, which provides the possibility for the construction of a deep semantic coherence thematic model.However, the topic modeling based on deep learning is still in its infancy, and how to combine the two is still a research topic to be solved urgently.In this paper, the deep learning technology is applied to the traditional text topic modeling process, aiming at constructing the deep topic model with deep semantic representation.Specifically, the work of this paper is divided into three parts: firstly, this paper proposes a word embedding model based on knowledge base constraint (SG_TransE(Skip-Gram with Transe) to realize the vectorization of vocabulary. SGSP integrates Skip-Gram model and TransE knowledge translation model.The resulting words can be embedded in the knowledge semantics.Secondly, a probabilistic topic model, DGPU-LDA(Double Generalized Polya Urn with LDAA, is proposed based on deep semantic enhancement.On the one hand, this model combines the document semantic coding framework DS-Bi-LSTM(Document Semantic Bi-directional LSTM based on bidirectional LSTM designed in this paper to realize the embedded representation of document macro semantics.On the other hand, document topic and lexical lexical dual GPU semantic enhancement mechanism and LSTM are used to describe Gibbs sampling process in the process of parameter inference.Finally, this paper reconstructs the DGPU-LDA model by neural network, and proposes the NS-LDA(Neural Semantic LDA-model. NS-LDA also integrates the DS-Bi-LSTM document semantic coding framework, and uses the hidden layer to encode the document-topic and subject-vocabulary information separately.Then the word score in the document is obtained by the product operation and used as the output of the neural network.The experimental results on Sogou news data set and 20 newsgroup data set show that the topic model DGPU-LDA and NS-LDA proposed in this paper are subject semantic coherence based on in-depth learning.The accuracy of text classification has some advantages over some advanced topic models, and it also shows the effectiveness of the depth topic model proposed in this paper in the semantic feature representation of text.
【学位授予单位】：武汉大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】