基于长短时记忆网络的多标签文本分类

发布时间：2018-09-19 16:48

【摘要】：分类问题一直以来都是人工智能领域的核心问题,随着文本内容的丰富,文本的语义呈现出多角度、多标签的特征,为了自动化地索引和管理这些内容,多标签文本分类问题变得重要起来。尽管文本分类技术已经得到了广泛研究,但随着标签个数的增加,多标签文本分类问题的复杂程度会指数增长,以至于传统技术无法很好地满足需求。因此,本文针对多标签文本分类问题开展了研究,主要工作如下:(1)本文分析了传统算法的缺陷,提出了基于词向量的层次化长短时记忆网络模型,分别在句子和文档层面对文本进行建模,从而得到整个文档的向量化表达。(2)在所提出模型的基础上,本文提出了两个对文本进行多标签分类的策略。一个基于多项逻辑回归对标签进行排序,再利用动态阈值调整技术得到预测结果;另一个利用了标签之间的结构特征构建了一棵标签树,训练了多个分类器在标签树上进行联合预测,还提出了多个联合预测的准则。(3)在纽约时报的新闻数据集上,文本设计了多个对比实验将算法与基准模型在多个指标上进行了对比。除此之外,本文还设计了多个实验探究模型在标签树上进行联合预测时,不同预测准则对模型性能的影响。本文的主要贡献有:(1)结合词向量特征和文本结构特征提出了层次化长短时记忆网络来学习文档的向量化表达,并结合多项逻辑回归和基于最小二乘法的动态阈值调整技术对标签进行排序和预测。实验表明此策略相对基准模型给多分类效果带来了巨大的提升(子集准确率提高38%,F1分数提高23%)。(2)合理利用了标签之间的结构特征建立了一棵标签树,对每个内部节点都训练了一个分类器,并在树中使用内部节点的分类器输出结果定义了不同的对边进行加权的方式,接着在赋权的标签树上使用A*搜索算法进行最短路径搜索来实现不同的联合预测准则。实验表明此策略在之前模型的基础上继续对多分类效果带来了显著的提升(子集准确率提高12%,F1分数提高2.5%)。
[Abstract]:Classification problem has always been the core problem in artificial intelligence field. With the enrichment of text content, text semantics presents features of multi-angle and multi-label, in order to automatically index and manage these contents. The problem of multi-label text classification is becoming more and more important. Although the technology of text classification has been widely studied, with the increase of the number of tags, the complexity of multi-label text classification problem will increase exponentially, so that the traditional technology can not meet the demand. Therefore, this paper studies the problem of multi-label text classification, the main work is as follows: (1) this paper analyzes the shortcomings of the traditional algorithm, and proposes a hierarchical long-short memory network model based on word vector. The text is modeled at the sentence and document levels, and the vectorization of the whole document is obtained. (2) based on the proposed model, this paper proposes two strategies to classify the text with multiple tags. One sorts the labels based on multiple logical regression, and then uses the dynamic threshold adjustment technique to get the prediction results; the other uses the structural features between the labels to construct a label tree. Several classifiers are trained to perform joint prediction on the label tree, and several criteria for joint prediction are proposed. (3) on the news data set of the New York Times, Several comparative experiments are designed to compare the algorithm with the benchmark model on a number of indicators. In addition, this paper also designs a number of experimental inquiry models on the label tree for joint prediction, different prediction criteria on the performance of the model. The main contributions of this paper are as follows: (1) combining word vector features and text structure features, a hierarchical long and short time memory network (LSTMN) is proposed to study the vectorization of documents. Combined with multiple logical regression and dynamic threshold adjustment based on least square method, the labels are sorted and predicted. The experimental results show that the strategy relative benchmark model has greatly improved the effectiveness of multi-classification (the accuracy of subset is increased by 38% and F1 score is increased by 23%). (2), and a label tree is established by using the structural features between tags reasonably. A classifier is trained for each internal node, and different ways of weighting edges are defined in the tree using the classifier output of the internal node. Then the shortest path search algorithm is used on the weighted label tree to realize different joint prediction criteria. The experimental results show that the strategy continues to improve the effectiveness of multi-classification based on the previous model. (the accuracy of subsets is improved by 12% and F1 score is increased by 2.5%).
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】