在线学习算法研究与应用

发布时间：2018-03-05 13:14

本文选题：在线学习　切入点：时间序列　出处：《浙江大学》2017年博士论文　论文类型：学位论文

【摘要】：随着信息技术的飞速发展和互联网应用的日益普及,数据产生的速度越来越快。传统的以批量数据处理为特点的离线学习算法无法适应大数据场景下流式数据的特点。在线学习算法能够持续不断地接受数据,动态实时地更新模型,适合大规模和流式数据的处理受到了研究者的高度重视,是当前机器学习领域的热点问题之一。在线学习算法的研究主要包括三个方面:(1)在线学习算法的理论分析;(2)在线学习算法应用在不同的机器学习任务中;(3)在线学习算法的收敛速率。本文围绕上述问题,从理论分析到具体应用对在线学习算法进行了比较系统的研究,一方面对已有算法的不足进行改进,一方面对若干未解问题提出新的解决方案。具体而言,本文的创新点如下:1.ADMM(Alternating Direction Method of Multipliers)是一个通用的优化框架,广泛应用于分布式机器学习的各种任务中。为了加速在线ADMM算法,将传统的在线ADMM算法的遗憾度理论分析从基于轮次的分析拓展到基于梯度变化的分析。论文针对两种类型的在线ADMM学习算法(FTRL-ADMM和PGD-ADMM),分别提出了改进的在线ADMM算法,并给出基于梯度变化的遗憾度分析,证明了提出的算法比已有的算法具有更紧凑的遗憾度上界。2.ARIMA 模型(Autoregressive integrated moving average)是时间序列预测中广泛使用的线性模型。然而,现有的关于ARIMA模型的学习算法都是离线学习算法且噪音项必须满足严格的假设条件,这严重阻碍了 ARIMA模型的通用性以及解决海量时间序列预测问题。因此,本文松弛了关于ARIMA模型噪音项的假设并提出了 ARIMA模型的在线学习算法。通过理论分析证明了提出的ARIMA模型在线学习算法能够趋近于最优的ARIMA模型离线学习算法。在人工数据集和真实数据集上进行一系列的验证,实验结果证明了所提出的算法的效率和有效性。3.近年来,通过在线学习求解非负矩阵分解任务的NN-PA算法在推荐系统的应用上取得了巨大的成功。为了加速NN-PA算法的收敛速度,论文提出了 NN-APA算法,利用二阶的梯度信息进行每轮更新,利用“专家学习”技术实现在线学习任务的参数自动调整。本文给出了新算法的理论分析,并证明了它比NN-PA算法收敛更快。在一系列关于推荐系统的数据集上进行了深度地实验分析,进一步验证了新算法的效率和效力。4.协同主题回归(Collaborative Topic Regression,简称CTR)模型结合了概率矩阵分解(probabilistic matrix factorization 简称 PMF)模型以及主题模型(topic modeling,例如LDA),利用文本信息提升推荐的准确率。尽管该模型在推荐领域取得了巨大的成功,然而现有的CTR模型推导算法bdi-CTR存在严重的缺陷。首先,bdi-CTR算法是离线算法,无法适应流式的数据或者现实中的大数据场景;其次,bdi-CTR算法首先用LDA计算产品相关的主题表达,然后把该结果推送到PMF求解过程中,它忽略了 PMF对LDA的作用,也就是说,该算法并没有考虑推荐预测信息对LDA推导主题模型的作用。因此本文提出了一个在线联合推导算法obi-CTR。提出的算法不但可以处理流式数据,还能利用PMF模型的结果来强化LDA模型的推导,两个模型互相曾增强从而达到联合优化的目的。实验结果显示,obi-CTR算法不但能高效地处理流式数据以及海量数据,还能同时增强主题模型的主题表达以及推荐系统的预测性能。
[Abstract]:With the rapid development of information technology and the increasing popularity of Internet applications, data generated faster and faster. In the traditional batch data processing for the characteristics of the off-line learning algorithm can not adapt to the characteristics of big data scene downflow data. Online learning algorithm can continuously receive data, real-time dynamically update the model for large scale and flow cytometry data has been highly valued by the researchers, is currently one of the hot issues in the field of machine learning. Online learning algorithm mainly includes three aspects: (1) online learning algorithm theory analysis; (2) online learning algorithm learning tasks in different machines; (3) online learning the convergence rate of the algorithm. Based on the above problems, from the theoretical analysis to the specific application of online learning algorithm is studied, a lack of existing algorithms for Improved, puts forward a new solution to some unsolved problems. Specifically, the innovations of this paper are as follows: 1.ADMM (Alternating Direction Method of Multipliers) is a general optimization framework, various tasks are widely used in distributed machine learning. In order to speed up the online ADMM algorithm, the traditional ADMM algorithm online regret the degree of theoretical analysis from the round analysis to based on gradient analysis of change. According to the two types of online ADMM learning algorithm (FTRL-ADMM and PGD-ADMM), were proposed to improve the online ADMM algorithm, and gives the gradient of regret degree analysis based on the proposed algorithm, proved to have more compact upper bound of regret the.2.ARIMA model is better than the existing algorithm (Autoregressive integrated moving average) is a widely used linear model of time series prediction. However, some are about ARIMA Model learning algorithm are off-line learning algorithm and the noise term must meet the strict assumptions, which seriously hindered the universality of the ARIMA model and solve massive time series prediction. Therefore, this paper relaxes on the ARIMA model of the noise hypothesis and put forward the ARIMA model of online learning algorithm. It is proved that the offline a learning algorithm of ARIMA model of online learning algorithm of ARIMA model is proposed to approach the optimal. To verify a series of artificial and real data sets. The experimental results prove that the proposed algorithm's efficiency and effectiveness of.3. in recent years, through the online learning NN-PA algorithm for solving non negative matrix factorization task has made great the successful application in recommendation system. In order to accelerate the convergence speed of NN-PA algorithm, this paper proposes the NN-APA algorithm, using the gradient information into the two order For each round of updates, automatically adjust the parameters by using "expert learning" technology to achieve online learning tasks. This paper gives the analysis of the new algorithm theory, and prove that it converges faster than NN-PA. In a series of recommendation system data sets were analyzed in depth experiments, further verify the efficiency and effectiveness of the new.4. Synergetic Algorithm (Collaborative Topic Regression, the theme of regression referred to as CTR) model combines probabilistic matrix factorization (probabilistic matrix factorization PMF (topic) model and subject model modeling, such as LDA), use to enhance the accuracy of recommendation text information. Although the model in the recommended field has achieved great success, however, the bdi-CTR CTR model there are serious defects in the existing algorithms. Firstly, bdi-CTR algorithm is offline algorithm, unable to adapt to the large flow of data or data scenes in reality; Secondly, the calculation expression of the theme product related bdi-CTR algorithm with LDA at first, then put the result onto the PMF solving process, it ignores the effect of PMF on LDA, that is to say, the algorithm does not consider the effect of recommended predictive information for derivation of the LDA topic model. Therefore, this paper proposes an online joint inference algorithm the obi-CTR. algorithm can not only deal with streaming data, are also using the results of the PMF model to strengthen the LDA model, two models with each other so as to achieve the purpose of Ceng Zengqiang joint optimization. Experimental results show that the obi-CTR algorithm not only can efficiently handle the data stream and massive data, but also enhance the performance prediction model to express the theme the theme and the recommendation system.

【学位授予单位】：浙江大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP181

【相似文献】