偏差数据下的半参数模型研究

发布时间：2018-05-25 22:14

本文选题：左截断 + 右删失　；参考：《中国科学技术大学》2015年博士论文

【摘要】：生存分析已经发展成为生物统计学最主要的领域之一,它在其他领域也有很重要的应用,包括可靠性理论,精算学,人口统计学,流行病学,社会学和经济学.由于抽样的复杂性,我们得到的实际数据大部分是有偏差的,例如,常见的删失数据和截断数据,它们都可被看作是一般偏差数据.当然,偏差数据也出现在许多其它领域中,例如生物医学,社会学,经济学,质量控制学等.当个体被抽样的概率取决于它本身的取值,即每个个体被抽样的概率不同时,所得到的数据为偏差数据.这是一个有趣的抽样问题,因为它偏好某一些个体而忽略另外一些个体.当收集到的数据是偏差数据时,原先关于简单数据的统计推断程序已经不再适用,我们必须寻找针对偏差数据的方法.本文用估计方程的方法来研究一般偏差数据下的半参数模型,因为半参数模型既含有易于解释的有限维参数,又含有增加模型灵活性的无限维未知函数. 在本文的第一章,我们首先介绍要研究的几种偏差数据类型,即删失数据,长度偏差数据和病例队列设计下收集到的数据.然后介绍要研究的几种生存分析中常见的半参数模型,即Cox模型,加性风险模型,半参线性转移模型,分位数回归模型和比例均值剩余寿命模型. 在本文的第二章,我们利用长度偏差数据的一个重要性质,即截断时间与登记后剩余时间具有相同的分布(HuangQin,2011,2012),来构造加性风险模型下的复合估计量,由此得到估计量的效率是原来左截断右删失估计量效率的二倍左右.我们和ChengHuang(2014)几乎同时最先利用复合估计方程这个概念.所得到估计量的大样本性质和有限样本下的随机模拟结果也将在本章中展示,同时我们将所提出的方法应用到美国Channing House数据上,发现效果很好. 在本文的第三章,我们利用一般左截断右删失数据的鞅结构和第二章中介绍的长度偏差数据的重要特性,提出了长度偏差数据下分位数回归模型的简单估计方程方法和复合估计方程方法.我们的方法并不需要估计删失变量的分布.因而跟ChenZhou(2012)和WangWang(2014)比起来,我们的方法减少了复杂度.我们通过经验过程和随机积分技巧建立了渐近性质,包括一致相合性和弱收敛性.和PengHuang(2008)类似,通过最小化一系列L1型的凸函数来得到简单的算法.新的估计方法可以简单的利用R语言中现有的函数.当估计方差时,由于极限方差中含有未知的密度函数,这在有限样本下的估计量是很不稳定的,所以我们通过推广Jin et al.(2001)的方法来估计方差.最后,我们将所提出的方法应用到美国Channing House数据上. 在本文的第四章,我们研究删失数据在病例队列设计下的比例均值剩余寿命模型.由南威尔士州一个镍炼油厂的实际数据驱动,这里我们想知道镍矿工人在现有的各种协变量下他还能活多久.而且这项研究的发病率很低,因此优先选择病例队列设计.通过提出加权估计方程来对回归参数和基本均值剩余寿命函数进行估计,并给出所提出估计量的大样本性质.然后,我们给出随机模拟结果来检验所提出方法在有限样本下的表现.最后,通过分析上面提到的南威尔士州镍炼油厂的实际数据来说明我们所提出的方法. 在本文的第五章,我们研究长度偏差数据在病例队列设计下的Cox模型.受SelfPrentice(1988)提出的伪似然方法和HuangQin(2012)提出的复合部分似然方法的启发,我们提出一个简单的复合伪部分似然方法.通过经验过程和无放回抽样收敛结果,我们也给出了病例设计下极大复合伪似然估计量和相应累积风险率函数的大样本性质.我们也展示了随机模拟实验结果,并用奥斯卡数据来说明所提出的估计方法. 在本文的第六章,我们讨论了长度偏差数据在病例队列设计下的半参线性转移模型LuTsaitis(2006)应用的是鞅积分表示和逆概率加权方法来处理右删失数据在病例队列设计下的半参线性转移模型.即使我们可以利用鞅积分表示来处理左截断,所得到的估计量在长度偏差抽样下并不是全有效的.我们继续利用第二章中提到的长度偏差数据的重要性质和逆概率加权方法来构造复合估计方程.所得到的估计方程可以利用简单的迭代算法来估计回归参数和未知的转移函数.我们给出了所提出估计量的渐近分布结果和它们的证明.通过展示随机模拟结果和一个实际例子分析来检验所提出的回归参数估计量在有限样本下的表现.
[Abstract]:Survival analysis has developed into one of the most important fields of biometrics, and it has important applications in other fields, including reliability theory, actuarial, demography, epidemiology, sociology, and economics. Because of the complexity of the sampling, most of the real data we get are biased, for example, the common censorship number. According to and cut off the data, they can all be seen as general deviation data. Of course, the deviation data also appears in many other fields, such as biomedicine, sociology, economics, and quality control. When the probability of an individual is sampled depends on its own value, that is, the probability of each body being sampled is different, and the data obtained is deviation. Data. This is an interesting sampling problem because it favors some individuals and neglects other individuals. When the collected data is deviation data, the original statistical inference program on simple data is no longer applicable. We must find a method for the deviation data. This paper uses the method of estimating the equation to study the general bias. The semi parametric model is based on the difference data, because the semi parametric model contains both the easy to interpret finite dimensional parameters and the infinite dimensional unknown function that adds flexibility to the model.
In the first chapter of this paper, we first introduce several types of deviation data, which are censored data, length deviation data and case cohort design. Then we introduce some common semi parametric models in the survival analysis, that is, Cox model, additive risk model, semi parametric linear transfer model, quantile regression model. Model and proportional mean residual life model.
In the second chapter of this paper, we use an important nature of the length deviation data, that is, the truncation time and the remaining time after the registration have the same distribution (HuangQin, 20112012), to construct the compound estimator under the additive risk model, and thus the efficiency of the estimator is about two times the efficiency of the original left truncated estimation. The concept of compound estimation equation is used almost at the same time as ChengHuang (2014). The large sample properties of the estimators and the random simulation results under the finite sample will also be shown in this chapter. Meanwhile, we apply the proposed method to the US Channing House data and find the effect is very good.
In the third chapter, we use the martingale structure of the normal left truncated right censored data and the important characteristics of the length deviation data in the second chapter. We propose a simple estimation equation method and a compound estimation equation method for the quantile regression model under the length deviation data. Our method does not need to estimate the distribution of the censored variables. Compared with ChenZhou (2012) and WangWang (2014), our method reduces the complexity. We build an asymptotic property through the experiential process and the random integration technique, including the uniform consistency and weak convergence. It is similar to PengHuang (2008). A simple algorithm is obtained by minimizing a series of L1 type convex functions. The new estimation method can be obtained. In order to simply use the existing function in the R language. When the variance is estimated, the estimator in the limited sample is very unstable because of the unknown density function in the limit variance. So we estimate the variance by extending the method of Jin et al. (2001). Finally, we apply the proposed method to the Channing House data in the United States. Up.
In the fourth chapter of this article, we study the proportional mean surplus life model of the deleted data in a case cohort design. Driven by the actual data of a nickel refinery in the state of South Wales, we want to know how long the nickel mine workers can live under the existing covariates. And the incidence of this study is very low, so the priority is to choose. Case cohort design. The weighted estimation equation is proposed to estimate the regression parameters and the basic mean mean residual life functions, and the large sample properties of the proposed estimators are given. Then, we give the random simulation results to test the performance of the proposed method under the limited sample. The actual data of the nickel refinery illustrate the proposed method.
In the fifth chapter of this paper, we study the Cox model of the length deviation data under the case cohort design. Inspired by the pseudo likelihood method proposed by SelfPrentice (1988) and the compound partial likelihood method proposed by HuangQin (2012), we propose a simple composite pseudo partial likelihood method. We also give the large sample properties of the maximum composite pseudo likelihood estimator and the corresponding cumulative risk rate function under the case of case design. We also show the results of the random simulation experiments and illustrate the proposed method using the Oscar data.
In the sixth chapter of this paper, we discuss the semi parametric linear transfer model LuTsaitis (2006) of the length deviation data under the case queue design (2006), which is the martingale integral representation and the inverse probability weighting method to deal with the semi parametric linear transfer model of the right censored data in the case queue design. Even if we can use martingale integral representation to deal with the left We continue to use the important properties of the length deviation data in the second chapter and the inverse probability weighting method to construct the compound estimation equation. The estimated equation can be used to estimate the regression parameters and the unknown transfer functions by the simple iterative method. We give the results of the asymptotic distribution of the proposed estimators and their proof. By showing the results of the random simulation and an actual example analysis, we test the performance of the proposed regression parameter estimator under the finite sample.
【学位授予单位】：中国科学技术大学
【学位级别】：博士
【学位授予年份】：2015
【分类号】：O212.1

【参考文献】