基于non-local先验的贝叶斯变量选择方法及其在极高维数据分析中的应用
发布时间:2018-08-01 14:06
【摘要】:目的:本文通过模拟研究比较基于non-local先验的贝叶斯变量选择方法、ISIS-SCAD、ISIS-MCP在极高维数据分析中的表现,并将其应用到弥漫性大B细胞淋巴瘤(DLBCL,diffuse large B cell lymphoma)基因表达数据中,找出与DLBCL分型有关的基因,为临床上DLBCL的诊断和治疗提供依据。方法:介绍基于non-local先验的贝叶斯变量选择方法—乘积逆矩先验(piMOM,product inverse moment)的基本原理,并将其与ISIS-SCAD、ISIS-MCP方法应用到二分类logistic回归中。模拟研究中,根据协方差结构的不同将协变量间相关程度分为三种情况:相互独立、复合对称相关、自回归相关;样本量n=50、100、200、400、600;自变量维数p=1000、3000,从模型相合性和模型预测准确性两个方面,评价不同极高维情况下三种变量选择方法的表现。实例分析中,将包含350个病人,3237个基因的DLBCL数据分为训练集(n=245)和测试集(n=105),分别运用piMOM、ISIS-SCAD、ISIS-MCP方法进行建模并验证,用AUC评价三种模型的优劣。结果:模拟研究发现:在p=1000和p=3000情况下,三种方法筛出的变量平均真阳性数大致相等,ISIS-SCAD、ISIS-MCP方法的平均假阳性数和预测均方误差、回归系数均方误差却明显高于non-local先验方法,且non-local先验方法随着维数的增加波动较小,较ISIS-SCAD、ISIS-MCP方法稳定。DLBCL基因表达数据经piMOM分析发现4个有意义的基因(MYBL1,CYB5R2,MAML3,BTLA),AUC为0.989;ISIS-SCAD发现7个有意义的基因(MYBL1,CYB5R2,MAML3,TNFRSF13B,S1PR2,SLC25A27,GAB1),AUC为0.981;ISIS-MCP发现5个有意义的基因(MYBL1,CYB5R2,MAML3,CHST2,SUB1),AUC为0.962。三种方法均筛出的基因为:MYBL1,CYB5R2,MAML3。结论:基于non-local先验的贝叶斯变量选择方法在模型选择和预测准确性方面优于传统的惩罚类方法,在一定程度上可以较好地控制假阳性率。MYBL1,BTLA,CYB5R2,MAML3可能与DLBCL分型有关。
[Abstract]:Objective: to compare the performance of non-local priori Bayesian variable selection method (ISIS-SCADADIS-MCP) in very high dimensional data analysis and to apply it to the expression data of diffuse large B-cell lymphoma (DLB) diffused large B cell lymphoma) gene. To find out the genes related to DLBCL typing and to provide evidence for the diagnosis and treatment of DLBCL. Methods: the basic principle of non-local priori Bayesian variable selection method, the product inverse moment), was introduced and applied to the two-class logistic regression with IS-SCADADIS-MCP method. In the simulation study, according to the structure of covariance, the correlation degree between covariables can be divided into three cases: mutual independence, compound symmetric correlation, autoregressive correlation; The sample size is 50100200400600 and the dimension of independent variable is p10000000.The performance of three variable selection methods under different extremely high dimensions is evaluated from two aspects of model consistency and model prediction accuracy. In the case study, the DLBCL data containing 350 patients with 3237 genes were divided into two sets: training set (nb245) and test set (nng105). The models were modeled and verified by the method of piMOM / IS-SCADADIS-MCP, and the advantages and disadvantages of the three models were evaluated by AUC. Results: the simulation results showed that the average true positive number of variables screened by the three methods was approximately equal to the average false positive number and the prediction mean square error of ISIS-SCADADIS-MCP method, but the mean square error of regression coefficient was significantly higher than that of non-local 's prior method. 涓攏on-local鍏堥獙鏂规硶闅忕潃缁存暟鐨勫鍔犳尝鍔ㄨ緝灏,
本文编号:2157819
[Abstract]:Objective: to compare the performance of non-local priori Bayesian variable selection method (ISIS-SCADADIS-MCP) in very high dimensional data analysis and to apply it to the expression data of diffuse large B-cell lymphoma (DLB) diffused large B cell lymphoma) gene. To find out the genes related to DLBCL typing and to provide evidence for the diagnosis and treatment of DLBCL. Methods: the basic principle of non-local priori Bayesian variable selection method, the product inverse moment), was introduced and applied to the two-class logistic regression with IS-SCADADIS-MCP method. In the simulation study, according to the structure of covariance, the correlation degree between covariables can be divided into three cases: mutual independence, compound symmetric correlation, autoregressive correlation; The sample size is 50100200400600 and the dimension of independent variable is p10000000.The performance of three variable selection methods under different extremely high dimensions is evaluated from two aspects of model consistency and model prediction accuracy. In the case study, the DLBCL data containing 350 patients with 3237 genes were divided into two sets: training set (nb245) and test set (nng105). The models were modeled and verified by the method of piMOM / IS-SCADADIS-MCP, and the advantages and disadvantages of the three models were evaluated by AUC. Results: the simulation results showed that the average true positive number of variables screened by the three methods was approximately equal to the average false positive number and the prediction mean square error of ISIS-SCADADIS-MCP method, but the mean square error of regression coefficient was significantly higher than that of non-local 's prior method. 涓攏on-local鍏堥獙鏂规硶闅忕潃缁存暟鐨勫鍔犳尝鍔ㄨ緝灏,
本文编号:2157819
本文链接:https://www.wllwen.com/yixuelunwen/zlx/2157819.html