基于Logistic回归的近邻择优插补法
本文关键词: 无回答 最近邻插补法 Logistic回归插补法 出处:《天津财经大学》2013年硕士论文 论文类型:学位论文
【摘要】:现实生活中进行数据收集时经常遇到无回答的现象。被调查者可能拒绝或忘记回答一项调查问题,文件丢失或者数据记录的不正确都会导致无回答。调查数据的无回答容易增加统计分析难度,导致统计分析结果出现较大偏差,降低了统计工作质量。由于研究“精确”的数据的收集方法是不存在的;很多情况下受时间和费用的限制,我们也不能重新去调查。事前预防是最有效的处理方法,由于现实中种种原因和条件的限制,事前处理方法往往并不能完全解决无回答的问题。无回答的事后插补法越来越受到重视,很多学者对此进行了深入的研究。 论文对前人研究过的插补方法进行了简单总结,在这些方法的基础上尝试了另外一种插补法—基于Logistic回归的近邻择优插补法。这种方法继承Logistic回归插补法的高精确度以及最近邻插补法的单元择优性质。论文将基于Logistic回归的近邻择优插补法与常用的均值插补法、最近邻插补法、回归插补法、Logistic回归插补法进行了模拟比较。考虑无回答率分别为5%、10%、20%、30%、40%和50%,回归变量个数分别为2、3、4和5的情况。模拟结果显示:对于分类数据,基于Logistic回归的近邻择优插补法和Logistic回归插补法都优于最近邻插补法。在有些情况下,基于Logistic回归近邻择优插补法优于Logistic回归插补法。对于连续型数据,方差较大时(如为0.25或1时),基于Logistic回归的近邻择优插补法明显优于其他方法,方差较小(如为0.01或0.04时),基于Logistic回归的近邻择优插补法的优势就不那么明显,并且该方法随着变量个数的增加,均方误差有上升的趋势。对于实际的数据,结果显示:随着缺失率的增加均方误差有增加的趋势,基于Logistic回归的近邻择优插补法的均方误差最小,波动性最小,插补效果较好。 通过模拟数据和实际数据说明了基于Logistic回归近邻择优插补法具有一定的优越性,希望为实际问题提供一种新的有参考价值的方法。
[Abstract]:In real life, data collection often occurs when there is no answer. Respondents may refuse or forget to answer a survey question. File loss or incorrect data recording will lead to no answer. No answer to the survey data is easy to increase the difficulty of statistical analysis, leading to a large deviation in the results of statistical analysis. The quality of statistical work has been reduced. There is no method of collecting "accurate" data in the study; In many cases, due to time and cost constraints, we can not re-investigate. Prior prevention is the most effective treatment, due to a variety of practical reasons and conditions. The method of pre-processing can not solve the unanswered problem completely. The method of post-interpolation without answer has been paid more and more attention to, and many scholars have made a deep research on it. In this paper, the interpolation methods which have been studied by the predecessors are briefly summarized. On the basis of these methods, we try another interpolation method-nearest neighbor optimal interpolation method based on Logistic regression. This method inherits the high precision and nearest neighbor of Logistic regression interpolation method. In this paper, the nearest neighbor optimal interpolation method based on Logistic regression and the commonly used mean interpolation method are proposed. The nearest neighbor interpolation method, regression interpolation method and Logistic regression interpolation method were simulated and compared. The number of regression variables is 2 ~ 3 ~ 4 and 5 respectively. The simulation results show that: for the classified data. The nearest neighbor optimal interpolation method based on Logistic regression and the Logistic regression interpolation method are better than the nearest neighbor interpolation method in some cases. The optimal interpolation method based on Logistic regression is better than the Logistic regression interpolation method. For the continuous data, the variance is larger (for example, 0.25 or 1:00). The nearest neighbor optimal interpolation method based on Logistic regression is obviously superior to other methods, and the variance is smaller (such as 0. 01 or 0. 04). The advantage of the nearest neighbor optimal interpolation method based on Logistic regression is not so obvious, and the mean square error increases with the increase of the number of variables. The results show that the mean square error increases with the increase of the loss rate. The nearest neighbor optimal interpolation method based on Logistic regression has the smallest mean square error, the smallest volatility and the better interpolation effect. The simulation data and the actual data show that the nearest neighbor optimal interpolation method based on Logistic regression has some advantages and hope to provide a new method with reference value for practical problems.
【学位授予单位】:天津财经大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:O212.1;C81
【参考文献】
相关期刊论文 前10条
1 梁琪;企业经营管理预警:主成分分析在logistic回归方法中的应用[J];管理工程学报;2005年01期
2 张师超;朱曼龙;黄j昌;;QENNI:一种缺失值填充的新方法[J];广西师范大学学报(自然科学版);2010年01期
3 王玉梅;王楠楠;;抽样调查中无回答误差的分析与调整[J];广西财经学院学报;2011年05期
4 花琳琳;施念;杨永利;赵天仪;施学忠;;不同缺失值处理方法对随机缺失数据处理效果的比较[J];郑州大学学报(医学版);2012年03期
5 严洁;任莉颖;;政治敏感问题无回答的处理:多重插补法的应用[J];华中师范大学学报(人文社会科学版);2010年02期
6 王彦平;;二重抽样中子抽样无回答的处理[J];科学技术与工程;2009年01期
7 武森;冯小东;单志广;;基于不完备数据聚类的缺失数据填补方法[J];计算机学报;2012年08期
8 王凤梅;胡丽霞;;一种基于近邻规则的缺失数据填补方法[J];计算机工程;2012年21期
9 杨军;赵宇;丁文兴;;抽样调查中缺失数据的插补方法[J];数理统计与管理;2008年05期
10 周影;刘龙;马维军;李季;刘海东;朱佶;李绍坤;;调查问卷中含缺失数据的等级变量的补缺方法[J];数学的实践与认识;2011年01期
相关博士学位论文 前1条
1 王睿;胃食管反流病流行病学调查及其缺失数据的处理方法研究[D];第二军医大学;2009年
,本文编号:1479644
本文链接:https://www.wllwen.com/shekelunwen/shgj/1479644.html