基于惩罚似然的变量选择方法及其在高维数据中的应用

发布时间：2018-09-03 18:13

【摘要】：随着信息技术的快速发展,我们能够获得到的数据信息量和变量维数越来越大。如何从众多候选模型中选择最佳的一个,就成为计量经济学重要的研究内容。好的变量选择方法能够改变传统方法存在的计算量大和过度拟合等问题,选出的模型有良好的预测精度和预测能力,有效地排除掉干扰变量,获得最简洁的模型。惩罚似然函数法作为连续的最优化过程,与传统的离散方法相比更稳定,即使变量个数很大时,通过运用合理的算法也能有效的执行。因此对于高维数据模型来说,用惩罚似然函数法来进行模型选择将会更加有效,准确,稳定。本文基于惩罚似然函数方法,研究了几类高维数据模型的变量选择方法,获得的方法能够同时进行模型选择和变量估计;此外,运用概率论和数理统计知识证明了估计量具有Oracle性质,包括能够以概率趋于1正确地选择模型以及估计量渐近地服从正态分布。具体来说,本文研究的方法及主要结论如下:首先,本文提出了高维数据模型自适应桥估计方法。受桥估计方法的启发,本文按照变量的重要性程度对惩罚项施加不同的权重,研究自适应桥估计量是否满足好的估计量的标准,即是否具有Oracle性质,包括能否以概率趋于1正确地选择模型以及估计量是否渐近地服从正态分布。本文证明了在适当的条件下,自适应桥估计方法具有Oracle性质。通过随机模拟和实际数据来评价自适应桥估计方法的良好的数值表现和实证表现。其次,本文研究了高维数据线性回归模型的M-估计方法,讨论了惩罚项为局部线性逼近情形下的估计量的性质。M-估计方法是涵盖最小一乘估计、分位数回归、最小二乘估计以及Huber回归的框架性方法。当数据出现异常值或误差项服从厚尾分布时,此时M-估计的特殊情形——最小一乘回归比最小二乘估计更加稳健。本文在理论上证明,通过施加一定的条件,M-估计和局部线性逼近结合作为目标函数获得的估计量具有良好的大样本性质;在数值模拟部分,选择了编写合适的算法展现了该方法具有更好的稳健性;对于超高维数据模型,我们也通过模拟说明向后回归与我们提出的方法相结合表现更好;在实证部分,通过实际数据说明了我们提出的方法能够很好的选择变量和估计参数。最后,本文研究了高维情形下基于Logistic模型的信贷违约客户识别方法。选取了信用评分模型中常用的Logistic模型对信贷违约行为的影响因素进行识别,同时利用所建立的Logistic模型对信贷客户的违约风险进行衡量与预测。数值模拟结果表明,本文提出的变量选择方法是有效的。实证结果也说明运用本文提出的高维数据模型的变量选择方法,可以选出具有较高解释能力和预测能力的模型。
[Abstract]:With the rapid development of information technology, we can obtain more and more data information and variable dimension. How to choose the best one from many candidate models has become an important research content in econometrics. A good variable selection method can change the problems existing in the traditional methods, such as large computation and over-fitting. The selected model has good prediction accuracy and prediction ability, effectively eliminates the interference variables, and obtains the most concise model. As a continuous optimization process, the penalty likelihood function method is more stable than the traditional discrete method, even when the number of variables is large, it can be executed effectively by using reasonable algorithm. Therefore, for high dimensional data model, it is more effective, accurate and stable to select the model by using the penalty likelihood function method. In this paper, based on the penalty likelihood function method, the variable selection methods for several kinds of high-dimensional data models are studied. The obtained methods can be used for model selection and variable estimation at the same time. By using probability theory and mathematical statistics, it is proved that the estimator has Oracle property, including the possibility of selecting the model correctly with probability approaching 1, and the asymptotic acceptance of the estimator from the normal distribution. The main conclusions are as follows: firstly, an adaptive bridge estimation method for high dimensional data model is proposed. Inspired by the bridge estimation method, this paper applies different weights to the penalty term according to the importance of the variable, and studies whether the adaptive bridge estimator meets the criteria of good estimator, that is, whether the adaptive bridge estimator has Oracle property. It includes whether the model can be selected correctly with probability approaching 1 and whether the estimator is asymptotically obedient to the normal distribution. In this paper, we prove that the adaptive bridge estimation method has Oracle property under proper conditions. The good numerical and empirical performance of the adaptive bridge estimation method is evaluated by random simulation and actual data. Secondly, in this paper, we study the M- estimation method of the linear regression model of high dimensional data, and discuss the properties of the estimator under the condition that the penalty term is local linear approximation. The frame method of least square estimation and Huber regression. When the outliers or error terms are distributed from the thick tail, the special case of M- estimation is more robust than the least square estimation. In this paper, it is theoretically proved that the estimator obtained by applying certain conditions and combining local linear approximation with M- estimator as objective function has a good large sample property. Choosing the appropriate algorithm to show that the method has better robustness; for ultra-high dimensional data model, we also show that backward regression and our proposed method is better; in the empirical part, The actual data show that the proposed method can select variables and estimate parameters well. Finally, this paper studies the identification method of credit default customers based on Logistic model. The Logistic model which is commonly used in the credit scoring model is selected to identify the influencing factors of the credit default and the Logistic model is used to measure and predict the default risk of the credit customers. The numerical simulation results show that the proposed variable selection method is effective. The empirical results also show that using the variable selection method of the high-dimensional data model proposed in this paper, we can select the model with higher interpretation and prediction ability.
【学位授予单位】：对外经济贸易大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：F224

【相似文献】