基于Lasso的高维数据线性回归模型统计推断方法比较
发布时间:2018-07-27 15:23
【摘要】:目的:本文将介绍五种基于Lasso的高维数据线性回归模型统计推断方法:Lasso-惩罚计分检验(Lasso Penalized Score Test,Lassoscore),多重样本拆分(Multiple Sample-Splitting,MS-split)、稳定选择(Stability Selection)、低维投射(Low-Dimensional Projection Estimate,LDPE)、协方差检验(Covariance test,Covtest),并将这五种方法作比较,分析其在不同高维数据情形下的表现。方法:分别介绍Lasso-惩罚计分检验、多重样本拆分、稳定选择、低维投射、协方差检验的基本原理。利用以下四个参数设置模拟数据,分别为:7种样本量n=50、75、100、150、200、300、400;两种自变量个数p=100、300;两种自变量间相关性,一是自变量间相互独立,二是自变量间相关性为corr(Xi,Xj)=0.5|i-j|;两种回归系数大小,一是β1=β2=β3=β4=β5=5,βj=0,j5。二是β1=β2=β3=β4=β5=0.15,βj=0,j5。以上四个参数分别构成不同情形的高维数据。采用R软件模拟数据并用五种方法做统计推断,最后以期望假阳性率(Expected False Positives,EFP)和检验效能(power)为评价指标,比较这五种方法在不同高维数据情形下的表现。结果:在理想高维数据情形下五种方法除协方差检验推断结果保守外其余方法表现都较好,其中稳定选择的EFP最低而检验效能最高,在五种方法中表现最好。低维投射、稳定选择、多重样本拆分都对βmin条件有要求,其中稳定选择过于其依赖βmin条件,所以在复杂高维数据情形下检验效能大幅度降低,表现差。在复杂高维数据情形下低维投射在大样本和小样本下表现都较保守,虽然在中等样本量时检验效能很高,但是以引入极高的假阳性为代价的。无论在何种数据情形下协方差检验推断结果都很保守。在复杂高维数据情形下Lasso-惩罚计分检验的检验效能是五种方法中最高的,其次为多重样本拆分,而Lasso-惩罚计分检验的EFP也是最高的,多重样本拆分的EFP基本接近0。结论:在常见复杂高维数据情形下Lasso-惩罚计分检验发现真实非零变量的能力优于其余四种方法,且其对βmin的要求低,但期望假阳性率高。多重样本拆分的发现真实非零变量的能力虽然依赖于数据对βmin条件的满足与否,但当条件不满足时仅次于Lasso-惩罚计分检验,且其期望假阳性率极低。所以在常见复杂高维数据中Lasso-惩罚计分检验和多重样本拆分是两种较好的高维线性回归模型统计推断方法,两者相对而言前者较宽松,后者较保守。在实际应用中虽然无法得知真实数据是否满足βmin条件,但可根据应用需求来选择合适的统计推断方法。
[Abstract]:Objective: this paper will introduce five statistical inference methods of high-dimensional data linear regression model based on Lasso: Lasso-penalty score test (Lasso Penalized Score Test-Lassoscore), Multiple Sample-Spliting (MS-split), stable selection of (Stability Selection), low-dimensional projection (LDPE), Covariance test Cov test, and covariance test. Compare these five methods, Its performance under different high dimensional data is analyzed. Methods: the basic principles of Lasso-penalty score test, multiple sample splitting, stable selection, low dimensional projection and covariance test were introduced respectively. Using the following four parameters to set up the simulation data, the following four parameters are used to set up the simulation data, respectively, that is, the sample size of 7 kinds of samples n / 7 / 100150200300400; the number of two independent variables p / 100300; the correlation between the two independent variables, one is the independence of the independent variables, the other is the correlation between the independent variables is corr (Xianxj) 0.5i-j, and the two regression coefficients are 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5J _ 5, 尾 _ j _ 0J _ 5. the two kinds of regression coefficients are: 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5 ~ (5). The other is 尾 _ 1 = 尾 _ 2 = 尾 _ 3 = 尾 _ 4 = 尾 _ 5N _ (0.15), 尾 _ (JJ) _ (0) J _ (5). The above four parameters constitute high dimensional data in different cases. The R software was used to simulate the data and five methods were used to make statistical inference. Finally, the expected false positive rate (Expected False positive rate) and the test effectiveness (power) were used as evaluation indexes to compare the performance of the five methods in different high-dimensional data cases. Results: in the case of ideal high-dimensional data, all the five methods performed well except covariance test inference results. Among them, the stable selection of EFP was the lowest and the test efficiency was the highest, and the five methods performed best. In the case of low dimensional projection, stable selection and multiple sample splitting, the 尾 min condition is required, and the stable selection is too dependent on the 尾 min condition, so the test efficiency is greatly reduced and the performance is poor in the case of complex high dimensional data. In the case of complex high-dimensional data, the low-dimensional projection is conservative in both large and small samples. Although the test efficiency is very high in the case of medium sample size, it is at the cost of introducing extremely high false positives. Covariance test inferences are conservative regardless of the data. In the case of complex high-dimensional data, the test efficiency of Lasso-penalty score test is the highest among the five methods, followed by multi-sample splitting, while the EFP of Lasso-penalty score test is the highest, and the EFP of multi-sample splitting is close to zero. Conclusion: Lasso-penalty score test shows that the ability of real non-zero variables is superior to the other four methods in the case of complex high-dimensional data, and its demand for 尾 min is low, but the expected false positive rate is high. The ability of multi-sample split to find real non-zero variables depends on whether the data satisfies the 尾 min condition, but when the condition is not satisfied, it is second only to Lasso-penalty score test, and its expected false positive rate is very low. Therefore, Lasso-penalty score test and multi-sample splitting are two better statistical inference methods for high-dimensional linear regression model in common complex high-dimensional data. The former is relatively loose and the latter is more conservative. Although it is impossible to know whether the real data satisfies the 尾 min condition in practical application, we can select a suitable statistical inference method according to the application requirements.
【学位授予单位】:山西医科大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:R195.1
本文编号:2148260
[Abstract]:Objective: this paper will introduce five statistical inference methods of high-dimensional data linear regression model based on Lasso: Lasso-penalty score test (Lasso Penalized Score Test-Lassoscore), Multiple Sample-Spliting (MS-split), stable selection of (Stability Selection), low-dimensional projection (LDPE), Covariance test Cov test, and covariance test. Compare these five methods, Its performance under different high dimensional data is analyzed. Methods: the basic principles of Lasso-penalty score test, multiple sample splitting, stable selection, low dimensional projection and covariance test were introduced respectively. Using the following four parameters to set up the simulation data, the following four parameters are used to set up the simulation data, respectively, that is, the sample size of 7 kinds of samples n / 7 / 100150200300400; the number of two independent variables p / 100300; the correlation between the two independent variables, one is the independence of the independent variables, the other is the correlation between the independent variables is corr (Xianxj) 0.5i-j, and the two regression coefficients are 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5J _ 5, 尾 _ j _ 0J _ 5. the two kinds of regression coefficients are: 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5 ~ (5). The other is 尾 _ 1 = 尾 _ 2 = 尾 _ 3 = 尾 _ 4 = 尾 _ 5N _ (0.15), 尾 _ (JJ) _ (0) J _ (5). The above four parameters constitute high dimensional data in different cases. The R software was used to simulate the data and five methods were used to make statistical inference. Finally, the expected false positive rate (Expected False positive rate) and the test effectiveness (power) were used as evaluation indexes to compare the performance of the five methods in different high-dimensional data cases. Results: in the case of ideal high-dimensional data, all the five methods performed well except covariance test inference results. Among them, the stable selection of EFP was the lowest and the test efficiency was the highest, and the five methods performed best. In the case of low dimensional projection, stable selection and multiple sample splitting, the 尾 min condition is required, and the stable selection is too dependent on the 尾 min condition, so the test efficiency is greatly reduced and the performance is poor in the case of complex high dimensional data. In the case of complex high-dimensional data, the low-dimensional projection is conservative in both large and small samples. Although the test efficiency is very high in the case of medium sample size, it is at the cost of introducing extremely high false positives. Covariance test inferences are conservative regardless of the data. In the case of complex high-dimensional data, the test efficiency of Lasso-penalty score test is the highest among the five methods, followed by multi-sample splitting, while the EFP of Lasso-penalty score test is the highest, and the EFP of multi-sample splitting is close to zero. Conclusion: Lasso-penalty score test shows that the ability of real non-zero variables is superior to the other four methods in the case of complex high-dimensional data, and its demand for 尾 min is low, but the expected false positive rate is high. The ability of multi-sample split to find real non-zero variables depends on whether the data satisfies the 尾 min condition, but when the condition is not satisfied, it is second only to Lasso-penalty score test, and its expected false positive rate is very low. Therefore, Lasso-penalty score test and multi-sample splitting are two better statistical inference methods for high-dimensional linear regression model in common complex high-dimensional data. The former is relatively loose and the latter is more conservative. Although it is impossible to know whether the real data satisfies the 尾 min condition in practical application, we can select a suitable statistical inference method according to the application requirements.
【学位授予单位】:山西医科大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:R195.1
【引证文献】
相关会议论文 前1条
1 闫丽娜;王彤;;惩罚COX模型和弹性网技术在高维数据生存分析中的应用[A];2011年中国卫生统计学年会会议论文集[C];2011年
,本文编号:2148260
本文链接:https://www.wllwen.com/kejilunwen/yysx/2148260.html