非标准Multi-armed bandit的随机调度

发布时间：2018-03-18 09:21

本文选题：最优停时　切入点：允许停时　出处：《华东师范大学》2016年博士论文　论文类型：学位论文

【摘要】：本文的主要目的是拓展具有指数策略的multi-armed bandit (MAB)随机调度模型,使之更符合复杂的现实背景：(1)诸arm具有不同的切换限制；(2)诸arm具有不同的折现率；(3)机器随机中断引起的不完全信息。为此,本文的另一个目的是研究带限制的最优停时问题和非参贝叶斯,使之适用于上述非标准的MAB。在随机变量集合的层面上,在带限制的停时类范围内,讨论最优停时问题,运用经典的概率理论给出一般结论。这理论涵盖离散时间、连续时间、半马氏框架下所得的经典结果。大致分三个阶段：在第一阶段在单指标的随机变量集的框架下展开,首先引入允许停时类的概念,建立带限制的最优停时模型,讨论两类价值族和最优停时的性质；接着建构最优停时存在的充分条件,进而讨论价值变量族的局部性质、正则性等。在第二阶段,把最优停时问题拓展到双指标容许随机变量类上,研究最优双停时的性质,所得结果自然可推广到多指标的情形。第三阶段,讨论第一阶段中的可及集,证明了可及集的可列停时分解的性质。在连续时间的随机MAB模型中,考虑了相互独立的arm均有自身允许的停止范围,且只有在该范围上才能切换,目标是最大化在无限时间上的期望总折扣报酬。首先,引入允许停止随机集的概念,建立过程版的带停止限制的最优停时一般理论；接着,基于EL Karoui and Karatzas (1994)的想法,运用所得的理论解决单arm的报酬过程与Gittins指标过程的关系,最后,运用Kaspi and Mandelbaum (1998)的偏移法(excursion method)证明Gittins指标的最优性,其中的论证过程也比以往的证明简洁。在连续时间的随机MAB模型中,同时了考虑arm的切换要求和变折现的情况。分别采用两种期望总折扣报酬,运用带限制的最优停时理论,导出相应的指数定义,运用偏移法,证明了其一指标为最优策略,而另一却不是。运用贝叶斯方法把带随机中断的调度问题转化为不完全信息的调度问题,选择期望折扣报酬为目标函数,分别在静态策略、动态策略下讨论最优指数策略特点,尤其是动态策略中的一步报酬率的情况,目的是想了解不同的贝叶斯框架对调度策略的影响。在静态策略下,采用一般框架与参数框架所得的结论基本相似；而就动态策略而言,通过分析两个例子的一步报酬率与贝叶斯框架的之间的关系,以此说明不同的贝叶斯结构对调度的影响。
[Abstract]:The main purpose of this paper is to extend the multi-armed bandit mabs stochastic scheduling model with exponential policy. Make it more in line with the complex realistic background: 1) the arm has different handoff restrictions / 2) and the arm has different discount rate / / 3) the incomplete information caused by the random interruption of the machine. Another purpose of this paper is to study the optimal stopping time problem with constraints and non-parametric Bayes, so that it can be applied to the above mentioned non-standard MAB.The optimal stopping time problem is discussed on the level of random variable set and within the stopping time class with constraints. A general conclusion is given by using the classical probability theory. This theory covers the classical results of discrete time, continuous time and semi-Markov frame. It is roughly divided into three stages: in the first stage, the results are expanded under the framework of a single index random variable set. Firstly, the concept of allowable stopping class is introduced, and a constrained optimal stopping time model is established to discuss the properties of two classes of value family and optimal stopping time, then the sufficient conditions for the existence of optimal stopping time are constructed, and then the local properties of the family of value variables are discussed. In the second stage, the optimal stopping time problem is extended to the class of two-parameter admissible random variables, and the properties of the optimal double stopping time are studied. In this paper, we discuss the reachability set in the first stage, and prove the property of the countable stopping time decomposition of the reachable set. In the continuous time stochastic MAB model, we consider that each independent arm has its own allowable stop range, and only in this range can we switch. The goal is to maximize the expected total discounted return in infinite time. Firstly, the concept of allowing stopping random sets is introduced, and the general theory of optimal stopping time with stop limit is established. Then, based on the idea of El Karoui and Karatzas 1994), The obtained theory is used to solve the relationship between the return process of a single arm and the Gittins index process. Finally, the excursion method of Kaspi and Mandelbaum 1998) is used to prove the optimality of the Gittins index. In the stochastic MAB model with continuous time, the switching requirements of arm and the case of variable discounting are taken into account. Two kinds of expected total discounted returns are adopted, and the optimal stopping time theory with restrictions is used. The corresponding exponential definition is derived, and the migration method is used to prove that one index is the optimal strategy while the other is not. The Bayesian method is used to transform the scheduling problem with random interruption into a scheduling problem with incomplete information. Choosing the expected discount return as the objective function, we discuss the characteristics of the optimal exponential strategy under static and dynamic strategies, especially the one-step return rate in the dynamic strategy. The purpose of this paper is to understand the influence of different Bayesian frameworks on scheduling policies. In static policies, the conclusions obtained by using general frameworks and parameter frameworks are basically similar. By analyzing the relationship between the one-step rate of return and the Bayesian framework of two examples, the influence of different Bayesian structures on scheduling is illustrated.
【学位授予单位】：华东师范大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：O212.8;F224

【相似文献】