复杂资料综合投影寻踪回归分析法与综合传统回归分析法的比较研究
发布时间:2018-04-25 15:42
本文选题:投影寻踪 + 综合传统回归分析 ; 参考:《中国人民解放军军事医学科学院》2017年硕士论文
【摘要】:高维数据统计分析在现在的医学科学研究中越来越普遍,数据的高维问题使得传统的多元统计分析方法遇到了一些问题,如高维数据计算量大、出现维数祸根、低维稳健性很好的统计分析方法在高维时稳健性变差等。传统的分析方法远不能满足高维数据分析的需要,尤其是当高维数据分布为非正态时,原有建立在服从正态分布基础上的多元统计分析方法更显得无能为力。在此背景下,投影寻踪在上世纪60~70年代开始出现。为了分析或研究高维数据,投影寻踪将高维数据投影到可反映其原始数据结构或特征的低维空间(1~3维)上,用投影指标来度量投影分布所含信息的多少。故投影寻踪关键在于找到投影指标取值最大或最小时的投影方向,而目前多采用遗传算法来寻找最优投影方向。将投影寻踪与回归分析技术相结合就形成了投影寻踪回归分析技术。本研究旨在通过对同一复杂资料,分别采用投影寻踪回归分析法和传统回归分析法进行分析,然后比较二者的拟合效果和预测效果,以研究出对此资料更适合采用哪种分析方法。本研究可使投影寻踪回归的适用性更为具体,也可引起医学统计学数据分析者对投影寻踪这一方法的认识,从而有利于今后进行复杂资料回归分析时方法的合理选择。本文中所用的投影寻踪回归分析方法主要包括R中所能实现的投影寻踪回归方法(PPR包中包括的三种方法,Spline法、Gcvspline法、Supsmu法)和自行编制的投影寻踪回归软件中使用的方法(Hermite多项式法)。在本文中综合传统回归分析法主要指多重线性回归分析、主成分回归、岭回归、偏最小二乘回归和稳健回归。本课题研究中关于“复杂资料”的界定包括以下2种情形:第一种情形:自变量之间存在多重共线性关系。对于多重共线性,本文中传统回归分析方法采用主成分回归、岭回归和偏最小二乘回归处理;具体计算,将通过SAS中REG、PRINCOMP和PLS过程来实现。第二种情形:数据中存在异常点。对于存在异常点情形,本文中传统回归分析方法采用稳健回归;具体计算,将通过SAS中ROBUSTREG过程来实现。本文除考虑进行上述复杂资料情况比较外,也进行了对于数据质量较好(数据本身质量较好、不存在多重共线性及异常点等,并且采用多重线性回归分析拟合及预测效果均很好)情况下投影寻踪回归分析方法和传统的多重线性回归分析方法的比较。本文主要采用决定系数和相对误差绝对值的平均值来评价拟合效果,主要采用各预测样本相对误差的绝对值和预测误差的均方来评价预测效果。对于实际数据拟合样本采用的是原始的样本数据,预测样本采用的是对应于相应变量的平均值、最大值、最小值、中位数、四分之一分位数、四分之三分位数所形成的6个统计量值。经本研究发现,当实际数据本身质量较好时,采用投影寻踪回归分析方法在拟合和预测效果上均好于多重线性回归分析方法,不过二者之间的差别不大。用投影寻踪回归分析拟合,决定系数在0.9703~0.9988之间,相对误差均值在0.0039~0.0187之间,预测样本的MSE在12.91~16.77之间;用多重线性回归分析拟合,决定系数为0.9639,相对误差均值为0.0224,预测样本的MSE为18.80。而对于模拟数据本身质量较好时,投影寻踪回归分析和多重线性回归分析二者在拟合和预测效果上相差很小,难分高下,二者拟合效果评价指标决定系数均在0.9942以上。本文分析了三个自变量间存在共线性的实际数据。对第一个存在共线性的数据分析结果为:采用传统回归分析方法(主成分回归、岭回归和偏最小二乘回归)拟合,决定系数在0.9351~0.9386之间,相对误差均值在0.0497~0.0528之间,对于预测样本的MSE,主成分回归为1.18,岭回归为0.66,PLS回归为1.14;采用投影寻踪回归分析拟合,决定系数在0.9756~0.9846之间,相对误差均值在0.0316~0.0363之间,预测样本的MSE在0.69~0.86之间。对第二个存在共线性的数据分析结果为:采用传统回归分析方法(主成分回归、岭回归和偏最小二乘回归)拟合,决定系数在0.9039~0.9820之间,相对误差均值在0.0174~0.0383之间,对于预测样本的MSE,主成分回归为126.59,岭回归为208.40,PLS回归为215.82;采用投影寻踪回归分析拟合,决定系数在0.9823~0.9927之间,相对误差均值在0.0104~0.0175之间,预测样本的MSE在11.00~27.25之间。对第三个存在共线性的数据分析结果为:采用传统回归分析方法(主成分回归、岭回归和偏最小二乘回归)拟合,决定系数在0.8023~0.8924之间,相对误差均值在0.0450~0.0642之间,对于预测样本的MSE,主成分回归为0.61,岭回归为0.36,PLS回归为0.23;采用投影寻踪回归分析拟合,决定系数在0.8851~0.9980之间,相对误差均值在0.0046~0.0481之间,预测样本的MSE在0.03~0.65之间。本文分析了两个数据中存在异常点的实际数据。对第一个存在异常点的数据分析结果显示不论是采用投影寻踪回归分析还是采用稳健回归分析,对数据的拟合效果都很差。传统回归分析,决定系数最高为0.3641;投影寻踪回归分析,决定系数在0.1857~0.6650之间。对第二个存在异常点的数据分析结果为:M回归决定系数为0.8982,相对误差均值为0.1377,预测样本的MSE为3.3919;投影寻踪回归分析,决定系数在0.9423~0.9563之间,相对误差均值在0.0899~0.1138之间,预测样本的MSE在2.3604~3.0308之间。从本文研究结果可以得出如下结论:(1)考虑到多重线性回归分析与投影寻踪回归分析对于数据本身质量较好时拟合效果相差不大且拟合决定系数在0.95以上,并且投影寻踪回归分析计算难于多重线性回归分析,故在数据本身质量较好情况下的回归分析本文推荐采用多重线性回归方法。(2)可以认为,当数据存在共线性时采用投影寻踪回归分析方法进行分析要好于传统的对共线性数据的处理办法(主成分回归、岭回归和偏最小二乘回归)。(3)暂且认为当数据中存在异常点时,采用投影寻踪回归分析效果好于稳健回归分析。(4)数据本身质量非常重要,在科学研究中要重视科研设计(特别是应注意找准找全对结果变量有影响的自变量、具有足够大的样本含量且样本对于总体的代表性足够好),如果研究者在前期数据收集上忽略或遗漏了重要的原因变量,后期通过统计分析也难以弥补。
[Abstract]:Statistical analysis of high dimensional data is becoming more and more common in medical science research now. The high dimension of data makes traditional multivariate statistical analysis methods meet some problems, such as high dimension of high dimension data, dimension curse, low dimensional robustness, high robustness and robustness, and so on. It can not meet the needs of high dimensional data analysis, especially when the distribution of high dimensional data is non normal, the original multivariate statistical analysis method based on normal distribution is more powerless. In this context, projection pursuit appears in the 60~70 age of last century. In order to analyze and study high dimensional data, the projection pursuit will be high dimension According to the projection to the low dimensional space (1~3 dimension) that can reflect the structure or feature of its original data, the projection index is used to measure the number of information contained in the projection distribution. Therefore, the key of the projection pursuit is to find the projection direction of the maximum or the hourly projection of the projection index, and the genetic algorithm is used to find the optimal projection direction. The regression analysis technique is combined to form a projection pursuit regression analysis technique. The purpose of this study is to analyze the same complex data by projection pursuit regression analysis and traditional regression analysis, and then compare the fitting effect and prediction effect of the two, so as to find out which method is more suitable for this data. The applicability of the projection pursuit regression is more specific, and it can also cause the understanding of the projection pursuit method by the medical statistics data analysts, which is beneficial to the rational selection of the method for the regression analysis of complex data in the future. The projection pursuit regression analysis method used in this paper mainly includes the projection search in R. The trace regression method (three methods included in the PPR package, Spline, Gcvspline, Supsmu) and the method used by the self compiled projection pursuit regression software (Hermite polynomial method). In this paper, the traditional regression analysis method mainly refers to multiple linear regression analysis, principal component regression, ridge regression, partial least squares regression and robust regression. The definition of "complex data" in this study includes the following 2 cases: first, there are multiple collinear relations between independent variables. For multiple collinearity, the traditional regression analysis method in this paper uses principal component regression, ridge regression and partial least squares regression; concrete calculations will pass through REG, PRINCOMP and PLS in SAS. Second cases: there are abnormal points in the data. For the case of abnormal points, the traditional regression analysis method in this paper uses robust regression; the concrete calculation will be realized through the ROBUSTREG process in SAS. In addition to the comparison of the above complex data, the quality of the data is better (the quality of the data itself). Better, there is no multiple collinearity and abnormal points, and the comparison of the projection pursuit regression analysis method and the traditional multiple linear regression analysis method is compared with the multiple linear regression analysis and the prediction effect is good. This paper mainly uses the mean value of the determination coefficient and relative error absolute value to evaluate the fitting effect. The prediction results are evaluated by the absolute value of the relative error of the prediction samples and the mean square of the prediction error. The original sample data are used for the actual data fitting samples. The predicted samples are based on the average, maximum, minimum, median, 1/4 digits and 3/4 digits corresponding to the corresponding variables. It is found that when the quality of the actual data is good, the projection pursuit regression analysis method is better than the multiple linear regression analysis method in the fitting and prediction results, but the difference between the two is not significant. The determination coefficient is between 0.9703~0.9988 and the mean value of relative error with the projection pursuit regression analysis. Between 0.0039~0.0187, the MSE of the predicted sample is between 12.91~16.77 and the multiple linear regression analysis is used. The decision coefficient is 0.9639, the mean of the relative error is 0.0224, the MSE of the predicted sample is 18.80. and the quality of the simulated data itself is good. The projection pursuit regression analysis and the multiweight linear regression analysis are two in the fitting and prediction effect. The difference is very small, it is difficult to divide high, and the determination coefficient of the evaluation index of the two is above 0.9942. This paper analyzes the actual data of the common linear between the three independent variables. The result of the first existence of the common linear data analysis is that the traditional regression analysis method (the principal component return, the ridge regression and partial least squares regression) fitting, is determined. The coefficient is between 0.9351~0.9386, the mean relative error is between 0.0497~0.0528, for the MSE of the predicted sample, the principal component regression is 1.18, the ridge regression is 0.66, the PLS regression is 1.14, and the projection pursuit regression analysis is used to determine the coefficient between 0.9756~0.9846, the relative error is between 0.0316~0.0363, and the MSE in the prediction sample is 0.69~0.86. The results of the analysis of second existing co linear data are: using the traditional regression analysis method (principal component regression, ridge regression and partial least square regression) fitting, the determining coefficient is between 0.9039~0.9820, the mean relative error is between 0.0174~0.0383, the MSE of the predicted sample, the principal component regression 126.59, the ridge regression 208.40, the PLS regression. For 215.82, using the projection pursuit regression analysis fitting, the determining coefficient is between 0.9823~0.9927, the mean relative error is between 0.0104~0.0175 and the MSE of the sample is between 11.00~27.25. The data analysis results for the third existing collinearity are: the traditional regression analysis method (principal component regression, ridge regression and partial least squares regression) is proposed. The coefficient of determination is between 0.8023~0.8924, the mean value of relative error is between 0.0450~0.0642, for the MSE of the predicted sample, the principal component regression is 0.61, the ridge regression is 0.36, the PLS regression is 0.23, and the projection pursuit regression analysis is used to determine the coefficient between 0.8851~0.9980, the mean of phase to error is 0.0046~0.0481, and the MSE of the prediction sample is 0. 3~0.65. This paper analyzes the actual data of the exception point in the two data. The data analysis results for the first abnormality point show that both the projection pursuit regression analysis or the robust regression analysis are used, the results of the data are very poor. The maximum coefficient of the traditional regression analysis is 0.3641; the projection pursuit regression is the most important. Analysis, the determination coefficient is between 0.1857~0.6650. The results of data analysis for second abnormality points are: M regression determination coefficient is 0.8982, relative error mean value is 0.1377, MSE of prediction sample is 3.3919; projection pursuit regression analysis, determining coefficient is between 0.9423~0.9563, relative error mean value is between 0.0899~0.1138, prediction sample MSE Between 2.3604~3.0308. From the results of this study, we can draw the following conclusions: (1) considering that multiple linear regression analysis and projection pursuit regression analysis have little difference in the fitting effect when the quality of the data is better and the fitting decision coefficient is more than 0.95, and the projection pursuit return analysis is difficult to multiply linear regression analysis. Regression analysis under the good quality of data itself is recommended by multiple linear regression methods. (2) it is considered that the projection pursuit regression analysis method is better than the traditional methods of processing common linear data (principal component return, ridge regression and partial least square regression). (3) The effect of projection pursuit regression analysis is better than robust regression analysis. (4) the quality of the data itself is very important. In scientific research, it is important to pay attention to the design of scientific research (especially the independent variable which should be paid attention to finding all the result variables, with a large enough sample content and the sample for the overall representation. " Good enough. If researchers ignore or omit important causal variables in previous data collection, it will be difficult to make up for later analysis by statistical analysis.
【学位授予单位】:中国人民解放军军事医学科学院
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:O212.1;R195.1
【参考文献】
相关期刊论文 前10条
1 高科;刘剑;刘玉姣;;回采工作面瓦斯涌出量遗传投影寻踪回归预测[J];中国安全科学学报;2015年03期
2 王金龙;黄炜斌;马光文;赵云发;谭磊;;梯级水电站群联合优化调度规则制定的投影寻踪回归法[J];水力发电学报;2015年02期
3 朱玲玲;牧振伟;杨力行;;悬栅消能工均匀正交设计及投影寻踪回归试验研究[J];水资源与水工程学报;2014年06期
4 李祚泳;刘韵;汪嘉杨;;基于指标规范值的水安全评价的投影寻踪回归模型[J];水利水电技术;2014年07期
5 苏屹;姜雪松;张成功;;投影寻踪法在企业评价体系中的应用综述[J];科技和产业;2013年11期
6 何建新;郭鹏飞;刘录录;杨力行;;阳离子乳化沥青混凝土配合比设计的优选方法研究[J];水利与建筑工程学报;2013年03期
7 刘录录;何建新;刘亮;杨力行;;胶凝砂砾石材料抗压强度影响因素及规律研究[J];混凝土;2013年03期
8 李祚泳,邓新民,侯宇光;投影寻踪回归技术在降水量预测中的应用[J];高原气象;1998年03期
9 李祚泳,邓新民,桑华民;台风登陆华南年频次的投影寻踪回归预测模型[J];热带气象学报;1998年02期
10 李祚泳;污染物浓度预测的PPR模型[J];环境科学;1997年04期
,本文编号:1801955
本文链接:https://www.wllwen.com/kejilunwen/yysx/1801955.html