针对零膨胀超散度计数数据的统计推断

发布时间：2018-05-02 00:12

本文选题：零膨胀 + ZIP模型　；参考：《昆明理工大学》2011年硕士论文

【摘要】：计数数据是广泛存在于日常生活和研究中的一类离散数据。对于该类数据,我们一般使用普通泊松分布对其进行回归分析。该方法在过去的实践和研究中被广泛应用。然而,相对于普通的泊松分布存在过分多零的计数数据,在日常生活和研究中也经常会碰到。对于该类计数数据,如果仍沿用普通的泊松分布去拟合,将会导致偏差过大的参数估计和错误的推断。为解决这一问题,针对该类数据的将普通泊松分布和在零点的退化分布混合起来构成的零膨胀泊松混合回归(ZIP)模型被提出来。而对于所研究的计数数据是否确实存在零膨胀的判断,对模型的选择起到决定性的作用。对此,本文提出了一种Score检验方法来判断所研究的计数数据是否存在零膨胀。如果零膨胀确实存在,则使用ZIP模型进行回归分析；否则,可继续沿用传统的相对简单的普通泊松分布进行回归分析。此外,对于普通的计数数据,由于纵向数据采集机制等原因,数据之间可能会存在关联性和分层结构。这时普通的单水平模型将不能得到理想的参数估计和检验结果。对此,针对这类有着分层结构的数据的多水平回归模型被提出来。本文基于最为广泛的具有分层结构的双水平数据,采用贝叶斯方法对该类数据进行了参数估计和检验判断。除了计数数据存在过分多零的情况外,对于非零部分的计数数据,也可能会存在相对于普通的泊松分布方差与均值存在较大偏差,即超散度的情形。此时,若仍采用普通的零膨胀泊松混合回归模型(ZIP)来处理该类数据,将不能得到最佳的拟合效果。而由于带有散度参数的负二项分布(NB)能够更充分的解释该散度过大的问题,所以,可以采用零膨胀负二项混合回归模型(ZINB)来处理该类数据以达到最佳拟合效果。而在模型的选择之前,对于所研究的数据是否存在超散度的检验也是必不可少的。为此,本文提出了针对双水平情形下的该类数据是否存在超散度的Score检验。若结果显示超散度不存在,则可使用ZIP模型进行回归分析：否则,应选用ZINB模型。在实际生活和研究中,经常会碰到数据缺失的情形,它给参数估计和模型推断带来了许多麻烦。对于该类缺失数据的处理,前人已经总结了大量的方法,但均是基于随机缺失的假设前提下,且认为各协变量是属于同一多元分布。而事实上,很多缺失是由于测量值超出测度范围或其它一些非随机因素引起的,即所谓的非随机缺失。对于该类缺失数据,传统的缺失数据处理方法将不再适合。针对该类缺失数据,本文将传统方法加以优化,即将缺失数据作为未知参数对待,再采用Gibbs抽样的方法,以及数据分解技巧来填充所缺失的数据,并将该方法应用到所研究的模型中。通过模拟结果显示,对于非随机缺失数据,该方法要明显优于随机缺失假设下的传统方法。最后,在本文的结尾,对于本文所做的工作进行了总结。并对针对计数数据的模型的后续研究方向做了一个初步的展望与预测。
[Abstract]:Counting data is a kind of discrete data which is widely used in daily life and research. For this kind of data, we generally use ordinary Poisson distribution to carry out regression analysis. This method is widely used in the past practice and research.
However, there is too much zero count data relative to common Poisson distribution, which is often encountered in daily life and research. For this kind of count data, if still using the common Poisson distribution to fit, it will lead to excessive parameter estimation and error inference. In order to solve this problem, the data will be common to the general data. The Poisson distribution and the zero expansion Poisson mixed regression (ZIP) model, which is mixed together with the degenerated distribution of the zero point, are proposed. But the decision of whether the counted data is indeed zero expansion is decisive for the selection of the model. In this paper, a Score test method is proposed to determine the number of counts studied. If there is a zero expansion, if the zero expansion does exist, the ZIP model is used for regression analysis; otherwise, the traditional relatively simple general Poisson distribution can continue to be used for regression analysis.
In addition, for common count data, there may be a correlation and hierarchical structure between the data due to the longitudinal data acquisition mechanism. The ordinary single level model will not get the ideal parameter estimation and test results. In this case, the multi level regression model for this kind of data with hierarchical structure is proposed. Based on the most widely used bi level data with hierarchical structure, Bayesian method is used to estimate and check the parameters of the data.
In addition to the excessive zero of the counting data, there may be a large deviation from the average Poisson distribution variance to the average of the ordinary Poisson distribution, that is, the case of excess dispersion. At this time, it will not be best to use the ordinary zero expansion Poisson mixed regression model (ZIP) to deal with this kind of data. The negative two term distribution (NB) with divergence parameters can more fully explain the problem of excessive divergence, so the zero expansion negative two term mixed regression model (ZINB) can be used to deal with this kind of data in order to achieve the best fitting effect. The test is also necessary. For this reason, this paper proposes a Score test for the existence of the hyper scatter for the class of data in a double level case. If the result shows that the hyper divergence does not exist, the ZIP model can be used for regression analysis. Otherwise, the ZINB model should be selected.
In real life and research, data lack is often encountered. It brings a lot of trouble to parameter estimation and model inference. For the processing of this kind of missing data, a large number of methods have been summed up, but they are based on the assumption of random deletion and are considered to belong to the same multivariate distribution. In fact, Many defects are caused by the measurement value beyond the range of measurement or other non random factors, that is, the so-called non random deletion. For the missing data, the traditional missing data processing method will no longer be suitable. In this paper, the traditional method is optimized for the missing data, and the missing data is treated as an unknown parameter, and then the data is taken as an unknown parameter. The method of Gibbs sampling and data decomposition technique are used to fill the missing data and apply the method to the model studied. The simulation results show that the method is obviously better than the traditional method under the random missing hypothesis for the non random missing data.
Finally, at the end of this paper, the work done in this paper is summarized, and a preliminary prospect and prediction are made for the follow-up research direction of the model for counting data.

【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：C81

【相似文献】