针对机器学习中残缺数据的近似补全方法

发布时间：2018-02-02 16:32

本文关键词： 机器学习残缺项二次规划补全方法　出处：《西安交通大学学报》2017年10期 　论文类型：期刊论文

【摘要】：针对机器学习中含残缺项的数据不能被有效利用,导致分类和回归准确率不高的问题,提出了一种近似补全方法——k-ANNO方法。给定残缺的数据样本,该方法首先通过离线构建的图结构来近似搜索与该样本最接近的k个近邻顶点,然后采用快速二次规划估计各近邻的最优权重,最后基于权重值来补全样本中的残缺项,用户可以根据实际需求在补全效率与准确性之间折中。k-ANNO方法较好地解决了机器学习中普遍存在的数据残缺问题,有效抑制了数据残缺对分类和回归精度的干扰。利用多份公开数据集评估了k-ANNO方法的补全效果,结果表明:当加速比在2~10之间时,k-ANNO方法的分类错误率比已有的均值补全、C均值补全、自组织映射补全方法低1%~4%,回归均方根误差比已有方法低约0.5~2.0;当样本规模为4 000时,在不同加速比参数下,k-ANNO方法的计算效率比朴素k近邻方法高约35%~320%。
[Abstract]:In order to solve the problem that the data with incomplete items can not be used effectively in machine learning, which leads to the low accuracy of classification and regression, an approximate complement method, k-ANNO method, is proposed, and the incomplete data samples are given. In this method, the nearest nearest vertices to the sample are approximately searched by the graph structure constructed offline, and then the optimal weights of each nearest neighbor are estimated by the fast quadratic programming. Finally, the incomplete items in the whole sample are compensated based on the weight value. According to the actual requirements, users can make a compromise between complete efficiency and accuracy. The method can solve the problem of data incomplete in machine learning. The interference of incomplete data on classification and regression accuracy is effectively suppressed. The complement effect of k-ANNO method is evaluated by using a number of open datasets. The results show that the acceleration ratio is between 2 ~ 10. The classification error rate of k-ANNO method is 1 / 4 lower than that of the existing method, and the root mean square error of regression is about 0.52.0 lower than that of the existing method. When the sample size is 4 000, the computational efficiency of the KANNO method is about 35% higher than that of the simple k nearest neighbor method under different speedup parameters.
【作者单位】：盲信号处理重点实验室;
【基金】：国家自然科学基金资助项目(U1536105)
【分类号】：TP181
【正文快照】： 机器学习是一种挖掘数据中潜在规律的有效方法,能够对研究对象的未知类别或数值进行预测,因而被广泛应用在计算机视觉、智能家居[1]、问卷分析[2]、基因组分析[3]等领域。当机器学习方法的输入数据包含残缺项时,许多机器学习方法的预测精度会急剧下降,导致漏检、虚警甚至模型

【相似文献】