基于随机变分的在线监督主题模型与并行化实现

发布时间：2018-03-17 02:28

本文选题：监督主题模型　切入点：MapReduce　出处：《吉林大学》2017年硕士论文　论文类型：学位论文

【摘要】：在机器学习研究领域中,主题模型(Topic Models)和监督主题模型(Supervised Topic Models)是对自然语言进行分析的通用模型。此类模型能够通过概率分布揭示语言文字内部的结构特征,并将其以“主题结构”以及“标签”的形式可视化。监督主题模型在现实中的文本分析、舆论监控以及电子商务等方面有着广泛的应用,因而成为机器学习的研究热点。然而,作为一种常用的监督主题模型,s LDA模型采用了一种变分EM算法以及坐标上升算法相嵌套的学习算法。随着数据量的增加,两种迭代优化算法的叠加使s LDA的训练时间呈指数级增长。此外,s LDA的学习算法属于离线训练的算法,这种特性不适用于日常生活中实时性要求高、数据量大的应用场景,如文本分类、舆论监控等问题,所有这些问题都严重地制约了监督主题模型的发展。针对以上问题,本文主要做出如下工作:1,提出了一种高效的监督主题模型的在线学习算法。本文采用随机变分推断的思想改进s LDA的学习算法,通过黎曼空间的自然梯度能够更准确的指向极大似然的理论,在学习过程中利用自然梯度替代了s LDA学习算法中的欧式空间梯度,从而加快了算法收敛的速度。此外,采用随机优化的思想,在迭代算法的每轮迭代中随机采样训练子集用以估计全局参数的梯度,以此降低模型的计算负担,而且赋予了s LDA在线学习的能力。2,提出了一种在线监督主题模型的并行学习算法,并实现了其对多种应用场景下的支持。由于在线监督主题模型中每轮迭代所采样的文档数量会对标签预测结果造成影响,所以训练算法需要能够灵活的设置每轮采集样本的大小。本文采用流行的Map Reduce并行计算框架,对在线监督主题模型采用分布式处理,使其能够应用于大规模数据的场景。另外,本文利用Python以及Mrjob的灵活性,实现了该算法支持单机单进程、单机多进程、分布式计算以及云计算的版本,进一步扩展其应用范围。
[Abstract]:In the field of machine learning, topic models and supervised Topic models are common models for analyzing natural languages. And it is visualized in the form of "theme structure" and "label". The supervisory subject model has been widely used in text analysis, public opinion monitoring and electronic commerce in reality, so it has become a research hotspot in machine learning. As a common supervised topic model, the LDA model adopts a variational EM algorithm and a learning algorithm nested with the coordinate rise algorithm. With the increase of the amount of data, The superposition of two iterative optimization algorithms makes the training time of s LDA increase exponentially. In addition, the learning algorithm of s LDA belongs to the offline training algorithm, which is not suitable for the application of high real-time and large amount of data in daily life. Such as text classification, public opinion monitoring and so on, all of these problems have seriously restricted the development of supervisory subject models. In this paper, we propose an efficient online learning algorithm for supervised topic models by doing the following work: 1. This paper uses the idea of random variational inference to improve the learning algorithm of s LDA. Through the theory that the natural gradient of Riemannian space can point to the maximum likelihood more accurately, the natural gradient is used to replace the Euclidean space gradient in the s LDA learning algorithm in the learning process, which speeds up the convergence of the algorithm. Using the idea of stochastic optimization, the random sampling training subset is used to estimate the gradient of global parameters in each iteration of the iterative algorithm, so as to reduce the computational burden of the model. Moreover, the ability of online learning of s LDA is given. 2. A parallel learning algorithm for online supervised topic model is proposed. It also supports various application scenarios. Because the number of documents sampled per iteration in the online monitoring topic model will affect the tag prediction results. Therefore, the training algorithm needs to be able to flexibly set the size of samples collected in each round. In this paper, the popular Map Reduce parallel computing framework is adopted, and the online supervisory subject model is distributed, which can be applied to large-scale data scenarios. This paper makes use of the flexibility of Python and Mrjob to implement the algorithm to support single machine single process, single machine multi-process, distributed computing and cloud computing.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;TP181

【相似文献】