不同缺失值处理技术的模拟比较
本文选题:缺失值 切入点:模拟技术 出处:《郑州大学》2012年硕士论文 论文类型:学位论文
【摘要】:目的 在艾滋病中医证候研究领域,数据缺失现象普遍存在。数据缺失会增加分析的复杂性,造成结果偏倚等一系列的问题。探索适合该数据库的缺失值填充方法是进行数据分析前迫切需要解决的问题。本研究以中医证侯现场调查数据为基础,通过数据模拟技术,比较不同的处理方法的优劣,探讨各自适用性,确定MI法的最佳填补次数,探索不同的缺失模式和缺失机制下,最为准确、高效、方便的处理方法。 方法 利用SAS9.1,模拟出完整数据集和不同缺失率的数据集,对于完全随机缺失和随机缺失的连续变量,采用期望最大化法(expectation maximization, EM)、回归法、均值填补法、成组删除法、多重填补法(multiple imputation, MI)进行填补,比较不同方法处理后的精确度、准确度以及均值。二分类变量,采用成组删除法和MI中的logistic回归进行填补,比较不同方法处理后的回归系数以及标准误。 结果 1.连续变量:本资料的数据均为任意缺失模式,随着填充次数的增加,填充效率逐渐增加,在MI填充10次时填充效率均达到0.95以上。精确度也伴随着填充次数的增加而逐渐增加,填充10次后精确度最高。关于准确度,缺失20%以下时,只需较少的填充次数(3-5次),就能达到较高的准确度;缺失率30-40%时,MI填充10次的准确度相对较高;缺失50%以上时,准确度不稳定。 2.完全随机缺失机制:缺失10%以下时,任何一种方法处理后,都与完整数据集均值一致,MI法的精确度和准确度最高。缺失20%以上时,采用成组删除法和MI法效果优于其他方法,MI法的精确度高,成组删除法的准确度高。 3.随机缺失机制:缺失较少时(10%-20%),采用MI法准确度、精确度高于其他方法。缺失30%时,采用成组删除法处理后的准确度高,但是精确度较差。缺失较多(缺失率40%)时,所有方法填充效果均不佳。 4.二分类变量,缺失较少(缺失率40%)时,采用成组删除法简单易行、准确、高效,而MI法程序比较复杂,需占用较大内存和时间进行反复填补,且结果不如成组删除法。缺失40%-50%时,采用MI/logistic回归法,只需较少的填补次数(2次)即可达到较好的效果。缺失率60%以上时,两种方法的处理效果均不好。 结论 对于大样本连续型变量资料,可认为服从正态分布,可容许的缺失比例在30%以下。传统的缺失值处理方法,如均值填补法和成组删除法简单、方便,具有一定的优势,但是MI法更能够解决相对比较普遍的问题,发挥优势的空间更大,方便了人们对绝大多数类型的缺失值进行填补,填补效率较高。
[Abstract]:Purpose. In the research field of TCM syndrome of AIDS, the phenomenon of missing data is common. Missing data will increase the complexity of analysis. A series of problems are caused by bias of results. It is urgent to solve the problem before data analysis by exploring the filling method of missing value suitable for this database. This study is based on the data of field investigation of TCM syndrome and is based on data simulation technology. Compare the advantages and disadvantages of different methods, discuss their applicability, determine the best filling times of MI method, explore the most accurate, efficient and convenient processing methods under different missing modes and mechanisms. Method. The complete data sets and data sets with different deletion rates were simulated by using SAS9.1. For the continuous variables with complete random deletions and random deletions, the expectation maximization method, EMU, regression method, mean filling method, group deletion method were used. Multiple multiple imputation (MII) method was used to fill, compare the accuracy, accuracy and mean value of two classifiable variables treated by different methods, and use group deletion method and logistic regression in MI to fill. The regression coefficient and standard error of different methods were compared. Results. 1. Continuous variables: the data in this data are arbitrary missing patterns, and the filling efficiency increases with the increase of filling times. When MI fills 10 times, the filling efficiency is more than 0.95. The accuracy increases gradually with the increase of filling times, and the accuracy is the highest after filling 10 times. For accuracy, when the accuracy is less than 20%, The accuracy of MI filling is relatively high when the missing rate is 30-40%, and the accuracy is unstable when the missing rate is more than 50%. 2. Complete random deletion mechanism: when missing below 10%, either method has the same accuracy and accuracy as the average of the complete data set. When missing more than 20%, the MI method has the highest accuracy and accuracy. The accuracy of group deletion method and MI method is higher than that of other methods, and the accuracy of group deletion method is higher than that of other methods. 3. Random deletion mechanism: when there are fewer deletions, the accuracy of MI method is higher than that of other methods. When missing 30, the accuracy of group deletion method is high, but the accuracy is poor. The filling effect of all methods is not good. 4. In the case of two classified variables with fewer deletions (the deletion rate is 40%), the method of group deletion is simple, accurate and efficient, while the MI method is more complicated and requires a large amount of memory and time to be filled repeatedly. The results were not as good as the group deletion method. When the deletion rate was 40% -50%, the MI/logistic regression method was used, only two times of filling were needed to achieve a better effect. When the deletion rate was more than 60%, the treatment effect of both methods was not good. Conclusion. For the data of large sample of continuous variables, it can be considered that the acceptable missing ratio is less than 30% from normal distribution. The traditional methods of processing missing values, such as mean value filling method and group deletion method, are simple, convenient and have certain advantages. But the MI method can solve the relatively common problems, and the space of exerting advantages is bigger, which makes it convenient for people to fill the missing value of most types, and the filling efficiency is higher.
【学位授予单位】:郑州大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:R181.3
【参考文献】
相关期刊论文 前10条
1 游晓锋;丁树良;刘红云;;缺失数据的估计方法及应用[J];江西师范大学学报(自然科学版);2011年03期
2 张国毅;宋德亮;王长宇;李冬梅;;相位差变化率定位法中缺失值精确填补研究[J];吉林大学学报(信息科学版);2010年01期
3 刘超;石冰;;一种基于相关函数法的奇异值补值方法[J];测试技术学报;2010年04期
4 霍忠诚;曾玲;范婷;;一种新的基于序数型不完备信息系统的粗糙集方法[J];桂林电子科技大学学报;2010年04期
5 李琳琳;杨永利;施学忠;时松和;马莹莹;刘爱华;谢世平;;HIV/AIDS患者中医四诊信息的主成分分析[J];郑州大学学报(医学版);2007年04期
6 王爱英;杨永利;施学忠;;艾滋病对河南省居民期望寿命的影响[J];郑州大学学报(医学版);2008年04期
7 花琳琳;施念;杨永利;赵天仪;施学忠;;不同缺失值处理方法对随机缺失数据处理效果的比较[J];郑州大学学报(医学版);2012年03期
8 茅群霞,李晓松;多重填补法Markov Chain Monte Carlo模型在有缺失值的妇幼卫生纵向数据中的应用[J];四川大学学报(医学版);2005年03期
9 李宏;阿玛尼;李平;吴敏;;基于EM和贝叶斯网络的丢失数据填充算法[J];计算机工程与应用;2010年05期
10 潘立强;李建中;骆吉洲;;传感器网络中一种基于时-空相关性的缺失值估计算法[J];计算机学报;2010年01期
相关硕士学位论文 前3条
1 刘志永;基于非随机缺失机制的模式混合模型医学应用研究[D];山西医科大学;2011年
2 茅群霞;缺失值处理统计方法的模拟比较研究及应用[D];四川大学;2005年
3 朱曼龙;最近邻方法在填充和分类中应用的新技术[D];广西师范大学;2010年
,本文编号:1557582
本文链接:https://www.wllwen.com/yixuelunwen/yufangyixuelunwen/1557582.html