面向异源数据的机器学习算法研究

发布时间：2018-01-22 11:08

本文关键词： 机器学习异源数据同构异源数据异构异源数据群智学习迁移学习　出处：《中国科学技术大学》2017年硕士论文　论文类型：学位论文

【摘要】：数据同源是传统机器学习依赖的基本假设,即训练数据和测试数据服从相同分布。但现实环境中同源数据十分稀缺,有限的同源数据无法训练出有效机器学习模型,这就是同源数据稀缺问题。解决同源数据稀缺问题的一个方法是人工构造同源数据,但这种方法成本过高。解决同源数据稀缺问题的另一个有效方法是整合分布不同的异源数据来进行机器学习模型的训练,因此面向异源数据的机器学习算法十分重要。根据样本空间是否相同,异源数据可以分为同构异源数据和异构异源数据。为了解决同源数据稀缺问题,可以将无标注的样本通过众包方式收集标注。每个参与众包的标注者被视作一个数据源,那么收集到的数据就是同构异源数据。面向这种同构异源数据的机器学习算法称为群智学习算法。根据求得目标分类器的步骤,群智学习算法分为二阶段方法和直接方法。个人分类器方法是群智学习直接方法中的代表方法,该算法拥有凸形式的目标函数但对模型参数分布做了强假设。本文提出一种非参数化的群智学习算法。该算法通过组合优化目标构造出凸形式的目标函数,并且没有对模型参数的分布做任何假设。另一种整合异源数据的方法是其他领域的数据来帮助目标领域的模型训练过程。不同领域的数据的样本空间和分布均不同,因此是异构异源数据。面向这种同构异源数据的机器学习算法称为迁移学习。根据迁移的方式不同迁移学习可以分为基于样本权重、基于特征表示以及基于模型参数三类迁移方法。本文研究并提出一种基于模型的迁移方法和一种基于模型和样本共同迁移的方法。这两种迁移方法均能利用辅助领域的数据改善目标领域的模型效果。
[Abstract]:Data homology is the basic assumption of traditional machine learning dependence, that is, training data and test data are distributed from the same, but in real environment homology data is very scarce. Limited homologous data can not train an effective machine learning model, which is the problem of the scarcity of homologous data, and one of the methods to solve the problem is to construct the homologous data manually. But the cost of this method is too high. Another effective way to solve the problem of the scarcity of homologous data is to integrate the heterogeneous data with different distribution to train the machine learning model. Therefore, the machine learning algorithm for heterogeneous data is very important. According to whether the sample space is the same or not, the heterogeneous data can be divided into isomorphic and heterogeneous data. Unannotated samples can be collected by crowdsourcing. Each annotator participating in crowdsourcing is considered as a data source. Then the data collected are isomorphic data. The machine learning algorithm for this kind of isomorphic data is called group intelligence learning algorithm. According to the steps of finding target classifier. The group intelligence learning algorithm is divided into two stages method and the direct method, and the personal classifier method is the representative method in the group intelligence learning direct method. The algorithm has convex form of objective function but makes a strong assumption on the distribution of model parameters. In this paper, a nonparametric group intelligence learning algorithm is proposed. The algorithm constructs convex form of objective function by combining optimization objectives. Another method of integrating heterologous data is the data from other fields to help the model training process in the target domain. The sample space and distribution of the data in different fields are not. Same. The machine learning algorithm for this kind of isomorphism data is called migration learning. According to the different transfer mode migration learning can be divided into sample weight. There are three kinds of migration methods based on feature representation and model parameters. In this paper, a model based migration method and a method based on model and sample migration are proposed. Both of these methods can make use of auxiliary methods. Domain data improves the model effect of the target domain.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】