基于深度学习的文本表示与分类方法研究

发布时间：2018-02-26 08:02

本文关键词： 文本分类文本表示深度学习词向量　出处：《北京科技大学》2016年博士论文　论文类型：学位论文

【摘要】：随着信息技术的广泛应用以及信息化建设的广泛开展,文本信息正爆炸式的增长,如何从众多资源中获取有效的信息成为关注的焦点。对文本内容提取和分类将成为解决文本信息管理问题的关键手段。其中,文本分类的基石是文本表示。传统的文本表示采用计数的表示形式。这种表示方法假设词与词之间是独立的,忽略了文本语义信息；并且在特征选择时引入很多人为的因素,提取到的特征具有高维度和高稀疏的特点,不能有效地表示文本。与此同时,现有文本种类多样、主题丰富给文本分类带来了新的挑战,尤其面对标签分布不均衡样本集时,传统分类方法的泛化能力较差。因此,设计新的基于语义的文本表示和分类算法已成为研究热点。近年来,深度学习通过所特有的层次结构,能够从低(浅)层特征中提取高层特征,很好地解决了这些问题,为提取有效的文本表示与建立高效精准的文本分类模型提供了有力的支持。深度学习技术在图像、语音和自然语言处理等领域都取得了重大进展,展示出了潜在的应用价值。因此,本文开展了基于深度学习模型的文本表示与文本分类相关研究,并取得了如下研究成果：1.基于混合深度信念网络的多类文本表示与分类方法针对普通、规则且多类的分类任务(如新闻文本),传统的Bag-of-words(BOW)表示面临高维度高稀疏的问题。本文基于文本关键字,以关键字的词向量表示作为文本输入,同时结合深度信念网络(Deep Belief Network, DBN)和深度玻尔兹曼机网络(Deep Boltzmann Machine,DBM),设计了一种HDBN(Hybrid Deep Belief Network)模型。文本分类和文本检索的实验结果表明,基于词向量嵌入的深度学习模型在性能上优于传统方法。此外,通过二维空间可视化实验,由HDBN模型提取的高层文本表示具有高内聚低耦合的特点。2.基于卷积神经网络结合深度玻尔兹曼机的多标签摘要文本表示与分类方法针对摘要类型的多标签文本分类任务(生物医学摘要文本),除了每篇文档有多个标签以外,还面临着该领域特殊的挑战：文本只含有题目和摘要信息,同时含有大量的医学生物词和缩写词。对此,本文分别提出了解决方案：首先,提出一种基于序列的文本输入表示方式(Document word SequenceEmbedding,DSE)。DSE用维基百科和命名实体对题目和摘要的文本信息进行扩展特征,并嵌入词向量表示,更好的保留文本上下文语义信息。其次,以扩展后的词向量作为文本输入,本文提出一种基于CNN(Convolutional Neural Network)模型提取文档的局部特征,结合DBM融合全局特征的模型(Boltzmann-Convolutional Neural Network,B-CNN),更好的提取高层文本表示。最后,通过标签聚类以及标签共现的相关关系,本文构建标签层次树,并设计有效的层次网络实现标签树。除此之外,本文还推导了B-CNN模型的误差传递求导公式,使得模型基于整体进行有监督训练和微调。实验结果表明B-CNN模型不仅在生物医学文本上获得好的性能,在其它领域也获得不错的效果。3.基于长短时记忆模型的不均衡多标签全文文本表示与分类方法对比多标签摘要文本分类任务,多标签全文分类任务面临两个新的挑战：(1)没有明显的文本关键词,需要依靠全文的词向量来进行文本表示；(2)样本分布不均衡,严重影响分类效果。本文针对LSTM改进了一种基于序列预测的LSTM2模型。首先,本文以文档单词的词向量序列为文本输入,使用LSTM有效地对全文文本提取文档全局特征。同时,分析文档与标签潜在关系,提取文档局部特征。以文档全局特征和局部特征整体作为高层文本表示,从而有效提高稀疏标签预测的几率。然后,本文使用Parser构建语义标签树,提取包含文档原始标签集的最小子树,并以遍历子树生成的序列作为该文档的新标签(序列)。使用LSTM模型对每个文档的标签(序列)进行学习和预测。实验结果表明LSTM2算法有效解决不均衡多标签全文文本的预测问题。
[Abstract]:With the extensive application of information technology and extensive development of information construction, the text information is explosive growth, how to get useful information from numerous resources become the focus of attention. The text content extraction and classification will become a key means to solve text information management problems. Among them, the cornerstone of text classification is the traditional text representation. The text representation form by counting. This representation between the word and the word is assumed independent, ignoring the semantic information of text; and many man-made factors are introduced in the feature selection, feature extraction to high dimension and sparse characteristics, can not effectively express the text. At the same time, the existing text variety the theme has brought new challenges to the rich text classification, especially in the face of the label uneven distribution of samples, the traditional classification method of poor generalization ability. Because of this,. New semantic text representation and classification algorithm based on has become a hot research topic. In recent years, deep learning through the hierarchy characteristic, from the low level features extraction (shallow) layer characteristics, a good solution to these problems, provide a strong support and text classification model built for high precision text extraction effectively. Deep learning technology in image, voice and Natural Language Processing and other fields have made significant progress, demonstrating the potential application value. Therefore, this paper carried out relevant studies on text representation and text classification model based on deep learning, and the findings are as follows: 1. multi class text mixed deep belief networks based on the representation and classification methods in general, and the multi class classification task rules (such as news text), the traditional Bag-of-words (BOW) said that faced with high dimension and high sparse problem. This paper Based on text keyword, keyword to word vector representation as text input, combined with a deep belief network (Deep Belief Network, DBN) and depth (Deep Boltzmann Machine Boltzmann machine network, DBM), the design of a HDBN (Hybrid Deep Belief Network) text classification and text retrieval model. The experimental results show that the learning model word vector embedded depth is superior to the traditional method based on two-dimensional space. In addition, through visualization experiment, high-level text extraction from the HDBN model that.2. has the characteristics of high cohesion and low coupling to the convolutional neural network with multi label text depth Boltzmann machine representation and classification method for multi label text classification task abstract types (Abstract biomedical text), in addition to each document has multiple tabs, also faces special challenges in this field: the text contains only questions The title and abstract information, containing both medical and biological words and abbreviations. Therefore, this paper proposes solutions: firstly, put forward a series of text input based on representation (Document word SequenceEmbedding, DSE.DSE) with Wikipedia and named entity text information on the topic and abstract of extended features, and embedded word vector representation, text semantic context information better. Secondly, the word vector expanded as text input, this paper proposes a method based on CNN (Convolutional Neural Network) local feature extraction model of documents, combined with the DBM fusion global features model (Boltzmann-Convolutional Neural Network, B-CNN), text extraction better. Finally the top the correlation between clustering and tag, tag co-occurrence, the tag hierarchy tree, and design level of network to achieve effective label Tree. In addition, this paper also deduces the error of the B-CNN model transfer derivative formula, which makes the model overall supervised training and fine-tuning based on B-CNN. The experimental results show that the model not only achieve good performance in biomedical text, in other areas also received good results.3. representation and classification method of multi label text classification task in contrast when the length of the memory model is not balanced multi label text based on multi label text classification tasks facing two new challenges: (1) no obvious text keywords, need to rely on the word vector for text representation; (2) uneven distribution of samples, seriously affect the classification results. This paper improved LSTM a LSTM2 model based on sequence prediction. Firstly, this paper takes the word document word vector sequence for text input, use the LSTM effectively to the full text of the document. The feature extraction At the same time, analysis of the document and label the potential relationship, extracting local features. Document to document global features and local features as high-level text representation, so as to improve the probability of sparse labeling prediction. Then, this paper use Parser to construct semantic tag tree, extract the original label set contains the document of minimal subtree, and the subtree traversal sequence generation as a new label of the document (sequence). Using LSTM model to label each document (sequence) were studied and predicted. The experimental results show that LSTM2 algorithm can effectively solve the problem of unbalanced prediction of multi label text.

【学位授予单位】：北京科技大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】