不同缺失值处理技术的模拟比较

发布时间：2018-03-02 18:02

本文选题：缺失值　切入点：模拟技术　出处：《郑州大学》2012年硕士论文　论文类型：学位论文

【摘要】：目的在艾滋病中医证候研究领域,数据缺失现象普遍存在。数据缺失会增加分析的复杂性,造成结果偏倚等一系列的问题。探索适合该数据库的缺失值填充方法是进行数据分析前迫切需要解决的问题。本研究以中医证侯现场调查数据为基础,通过数据模拟技术,比较不同的处理方法的优劣,探讨各自适用性,确定MI法的最佳填补次数,探索不同的缺失模式和缺失机制下,最为准确、高效、方便的处理方法。方法利用SAS9.1,模拟出完整数据集和不同缺失率的数据集,对于完全随机缺失和随机缺失的连续变量,采用期望最大化法(expectation maximization, EM)、回归法、均值填补法、成组删除法、多重填补法(multiple imputation, MI)进行填补,比较不同方法处理后的精确度、准确度以及均值。二分类变量,采用成组删除法和MI中的logistic回归进行填补,比较不同方法处理后的回归系数以及标准误。结果 1.连续变量：本资料的数据均为任意缺失模式,随着填充次数的增加,填充效率逐渐增加,在MI填充10次时填充效率均达到0.95以上。精确度也伴随着填充次数的增加而逐渐增加,填充10次后精确度最高。关于准确度,缺失20%以下时,只需较少的填充次数(3-5次),就能达到较高的准确度；缺失率30-40%时,MI填充10次的准确度相对较高；缺失50%以上时,准确度不稳定。 2.完全随机缺失机制：缺失10%以下时,任何一种方法处理后,都与完整数据集均值一致,MI法的精确度和准确度最高。缺失20%以上时,采用成组删除法和MI法效果优于其他方法,MI法的精确度高,成组删除法的准确度高。 3.随机缺失机制：缺失较少时(10%-20%),采用MI法准确度、精确度高于其他方法。缺失30%时,采用成组删除法处理后的准确度高,但是精确度较差。缺失较多(缺失率40%)时,所有方法填充效果均不佳。 4.二分类变量,缺失较少(缺失率40%)时,采用成组删除法简单易行、准确、高效,而MI法程序比较复杂,需占用较大内存和时间进行反复填补,且结果不如成组删除法。缺失40%-50%时,采用MI/logistic回归法,只需较少的填补次数(2次)即可达到较好的效果。缺失率60%以上时,两种方法的处理效果均不好。结论对于大样本连续型变量资料,可认为服从正态分布,可容许的缺失比例在30%以下。传统的缺失值处理方法,如均值填补法和成组删除法简单、方便,具有一定的优势,但是MI法更能够解决相对比较普遍的问题,发挥优势的空间更大,方便了人们对绝大多数类型的缺失值进行填补,填补效率较高。
[Abstract]:Purpose. In the research field of TCM syndrome of AIDS, the phenomenon of missing data is common. Missing data will increase the complexity of analysis. A series of problems are caused by bias of results. It is urgent to solve the problem before data analysis by exploring the filling method of missing value suitable for this database. This study is based on the data of field investigation of TCM syndrome and is based on data simulation technology. Compare the advantages and disadvantages of different methods, discuss their applicability, determine the best filling times of MI method, explore the most accurate, efficient and convenient processing methods under different missing modes and mechanisms. Method. The complete data sets and data sets with different deletion rates were simulated by using SAS9.1. For the continuous variables with complete random deletions and random deletions, the expectation maximization method, EMU, regression method, mean filling method, group deletion method were used. Multiple multiple imputation (MII) method was used to fill, compare the accuracy, accuracy and mean value of two classifiable variables treated by different methods, and use group deletion method and logistic regression in MI to fill. The regression coefficient and standard error of different methods were compared. Results. 1. Continuous variables: the data in this data are arbitrary missing patterns, and the filling efficiency increases with the increase of filling times. When MI fills 10 times, the filling efficiency is more than 0.95. The accuracy increases gradually with the increase of filling times, and the accuracy is the highest after filling 10 times. For accuracy, when the accuracy is less than 20%, The accuracy of MI filling is relatively high when the missing rate is 30-40%, and the accuracy is unstable when the missing rate is more than 50%. 2. Complete random deletion mechanism: when missing below 10%, either method has the same accuracy and accuracy as the average of the complete data set. When missing more than 20%, the MI method has the highest accuracy and accuracy. The accuracy of group deletion method and MI method is higher than that of other methods, and the accuracy of group deletion method is higher than that of other methods. 3. Random deletion mechanism: when there are fewer deletions, the accuracy of MI method is higher than that of other methods. When missing 30, the accuracy of group deletion method is high, but the accuracy is poor. The filling effect of all methods is not good. 4. In the case of two classified variables with fewer deletions (the deletion rate is 40%), the method of group deletion is simple, accurate and efficient, while the MI method is more complicated and requires a large amount of memory and time to be filled repeatedly. The results were not as good as the group deletion method. When the deletion rate was 40% -50%, the MI/logistic regression method was used, only two times of filling were needed to achieve a better effect. When the deletion rate was more than 60%, the treatment effect of both methods was not good. Conclusion. For the data of large sample of continuous variables, it can be considered that the acceptable missing ratio is less than 30% from normal distribution. The traditional methods of processing missing values, such as mean value filling method and group deletion method, are simple, convenient and have certain advantages. But the MI method can solve the relatively common problems, and the space of exerting advantages is bigger, which makes it convenient for people to fill the missing value of most types, and the filling efficiency is higher.
【学位授予单位】：郑州大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：R181.3

【参考文献】