利用未标记数据的机器学习方法研究

发布时间：2018-04-23 10:26

本文选题：机器学习 + 半监督学习　；参考：《南京大学》2017年硕士论文

【摘要】：机器学习需要有标记数据来训练模型进行预测,有标记数据的获取通常需要人工参与,因此价格非常昂贵。在很多实际应用中,未标记数据可以较为容易地大量获取,如何利用廉价的未标记数据一直以来都是机器学习领域中的研究热点。目前出现了两种利用未标记数据的方法:一种是自动利用未标记数据辅助有标记数据提升学习性能的半监督学习;虽然该类方法大多能够提升学习性能,但都基于潜在的模型假设,当模型假设与数据分布存在偏差时可能会降低学习性能;另一种是通过众包以较低的代价给数据提供标记,进而可以精确利用未标记数据以降低学习风险。本文主要围绕半监督学习和众包进行研究,取得了以下进展:第一,针对半监督学习中的重要风范协同训练易受不充分视图的影响这一问题,提出了一种新型的加权协同训练算法。视图不充分时协同训练过程中会出现与最优分类器不一致的样本,该算法通过检测潜在的不一致样本并降低其权值以减少这些样本对训练过程的影响。实验结果表明,与标准的协同训练算法相比该算法有更好的泛化性能与更强的鲁棒性。第二,针对众包过程中任务标记依赖于任务难度这一特点,提出了一种新型的任务分配算法。该算法通过估计部分任务的难度构建训练集学得预测难度的模型,将任务分为简单和困难两类。对于简单的任务可利用众包进行标记;而对于困难的任务,则需雇佣专家为其提供高质量标记。实验结果表明该算法能够在提高标记质量的同时降低标记代价。此外,本文还对利用未标记数据的模型复用进行了研究,该场景中用户需要集成多个无法修改的预训练模型,针对这一问题,本文提出了一种新型的多视图模型复用算法。该算法通过信念传播估计预训练模型的可靠性,并基于未标记数据上的多视图一致性指导这一估计过程,进而利用估计得到的可靠性加权集成多个预训练模型。实验结果表明该方法能够显著提升分类精度。
[Abstract]:Machine learning requires labeled data to train models for prediction, and the acquisition of labeled data usually requires manual participation, so the price is very expensive. In many practical applications, unlabeled data can be easily obtained in large quantities. How to use cheap unlabeled data has always been a hot topic in the field of machine learning. At present, there are two methods to use unlabeled data: one is to use unlabeled data automatically to assist semi-supervised learning with labeled data to improve learning performance, although most of these methods can improve learning performance. But both are based on underlying model assumptions, which can reduce learning performance when the model assumption deviates from the data distribution; the other is to tag the data at a lower cost through crowdsourcing. Furthermore, unlabeled data can be used accurately to reduce the risk of learning. This paper mainly focuses on semi-supervised learning and crowdsourcing, and has made the following progress: first, aiming at the problem that the important cooperative training in semi-supervised learning is easily affected by insufficient views, A new weighted cooperative training algorithm is proposed. When the view is not sufficient, there will be samples that are inconsistent with the optimal classifier. The algorithm can reduce the influence of these samples on the training process by detecting the potentially inconsistent samples and reducing their weights. Experimental results show that the proposed algorithm has better generalization performance and better robustness than the standard cooperative training algorithm. Secondly, a new task assignment algorithm is proposed to solve the problem that task marking depends on task difficulty in crowdsourcing. By estimating the difficulty of some tasks, the algorithm constructs a training set model to predict the difficulty, and divides the task into two categories: simple and difficult. Simple tasks can be tagged with crowdsourcing; for difficult tasks, specialists are hired to provide high quality tags. Experimental results show that the proposed algorithm can improve the marking quality and reduce the marking cost. In addition, this paper also studies the reuse of models using unlabeled data. In this scenario, users need to integrate several pre-training models that can not be modified. In order to solve this problem, a new multi-view model reuse algorithm is proposed in this paper. The algorithm estimates the reliability of the pre-training model through belief propagation, and guides the estimation process based on multi-view consistency on unlabeled data, and then integrates multiple pre-training models weighted by the estimated reliability. Experimental results show that this method can significantly improve the classification accuracy.
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【共引文献】