基于Logistic回归的近邻择优插补法

发布时间：2018-01-31 17:52

本文关键词： 无回答最近邻插补法 Logistic回归插补法　出处：《天津财经大学》2013年硕士论文　论文类型：学位论文

【摘要】：现实生活中进行数据收集时经常遇到无回答的现象。被调查者可能拒绝或忘记回答一项调查问题,文件丢失或者数据记录的不正确都会导致无回答。调查数据的无回答容易增加统计分析难度,导致统计分析结果出现较大偏差,降低了统计工作质量。由于研究“精确”的数据的收集方法是不存在的；很多情况下受时间和费用的限制,我们也不能重新去调查。事前预防是最有效的处理方法,由于现实中种种原因和条件的限制,事前处理方法往往并不能完全解决无回答的问题。无回答的事后插补法越来越受到重视,很多学者对此进行了深入的研究。论文对前人研究过的插补方法进行了简单总结,在这些方法的基础上尝试了另外一种插补法—基于Logistic回归的近邻择优插补法。这种方法继承Logistic回归插补法的高精确度以及最近邻插补法的单元择优性质。论文将基于Logistic回归的近邻择优插补法与常用的均值插补法、最近邻插补法、回归插补法、Logistic回归插补法进行了模拟比较。考虑无回答率分别为5%、10%、20%、30%、40%和50%,回归变量个数分别为2、3、4和5的情况。模拟结果显示：对于分类数据,基于Logistic回归的近邻择优插补法和Logistic回归插补法都优于最近邻插补法。在有些情况下,基于Logistic回归近邻择优插补法优于Logistic回归插补法。对于连续型数据,方差较大时(如为0.25或1时),基于Logistic回归的近邻择优插补法明显优于其他方法,方差较小(如为0.01或0.04时),基于Logistic回归的近邻择优插补法的优势就不那么明显,并且该方法随着变量个数的增加,均方误差有上升的趋势。对于实际的数据,结果显示：随着缺失率的增加均方误差有增加的趋势,基于Logistic回归的近邻择优插补法的均方误差最小,波动性最小,插补效果较好。通过模拟数据和实际数据说明了基于Logistic回归近邻择优插补法具有一定的优越性,希望为实际问题提供一种新的有参考价值的方法。
[Abstract]:In real life, data collection often occurs when there is no answer. Respondents may refuse or forget to answer a survey question. File loss or incorrect data recording will lead to no answer. No answer to the survey data is easy to increase the difficulty of statistical analysis, leading to a large deviation in the results of statistical analysis. The quality of statistical work has been reduced. There is no method of collecting "accurate" data in the study; In many cases, due to time and cost constraints, we can not re-investigate. Prior prevention is the most effective treatment, due to a variety of practical reasons and conditions. The method of pre-processing can not solve the unanswered problem completely. The method of post-interpolation without answer has been paid more and more attention to, and many scholars have made a deep research on it. In this paper, the interpolation methods which have been studied by the predecessors are briefly summarized. On the basis of these methods, we try another interpolation method-nearest neighbor optimal interpolation method based on Logistic regression. This method inherits the high precision and nearest neighbor of Logistic regression interpolation method. In this paper, the nearest neighbor optimal interpolation method based on Logistic regression and the commonly used mean interpolation method are proposed. The nearest neighbor interpolation method, regression interpolation method and Logistic regression interpolation method were simulated and compared. The number of regression variables is 2 ~ 3 ~ 4 and 5 respectively. The simulation results show that: for the classified data. The nearest neighbor optimal interpolation method based on Logistic regression and the Logistic regression interpolation method are better than the nearest neighbor interpolation method in some cases. The optimal interpolation method based on Logistic regression is better than the Logistic regression interpolation method. For the continuous data, the variance is larger (for example, 0.25 or 1:00). The nearest neighbor optimal interpolation method based on Logistic regression is obviously superior to other methods, and the variance is smaller (such as 0. 01 or 0. 04). The advantage of the nearest neighbor optimal interpolation method based on Logistic regression is not so obvious, and the mean square error increases with the increase of the number of variables. The results show that the mean square error increases with the increase of the loss rate. The nearest neighbor optimal interpolation method based on Logistic regression has the smallest mean square error, the smallest volatility and the better interpolation effect. The simulation data and the actual data show that the nearest neighbor optimal interpolation method based on Logistic regression has some advantages and hope to provide a new method with reference value for practical problems.
【学位授予单位】：天津财经大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：O212.1;C81

【参考文献】