基于实例的领域适应增量学习方法研究
发布时间:2018-04-03 20:37
本文选题:文本分类 切入点:实例迁移 出处:《南京理工大学》2017年硕士论文
【摘要】:随着互联网技术的高速发展,人们能够在互联网上获取到的信息与日俱增。信息的爆炸式增长有利也有弊,如何高效且充分地利用这些信息成为学术界和工业界亟待解决的问题。文本分类是解决此类问题的一种比较常用技术,按照学习的方式可以分为领域特定和领域适应文本分类。目前已有许多基于实例迁移的领域适应算法,然此类方法存在一个共性的现象,即实例权重过度学习造成的过拟合问题。据了解,目前还没有任何工作明确讨论过该问题,本文将对此进行系统的研究。另外,在自然语言处理领域,传统的统计机器学习模型通常是单任务的,即模型是从训练数据中一次性地学习得到的。这无疑限制了算法的泛化性与可扩展性,本文将针对该弊端进行增量式改进。首先,本文介绍了当前有代表性的基于实例的领域适应算法ILA,并在此基础上提出了正则化方法以强化迁移学习的效果。正则化方法分为六种子方法:三种基于Early-stopping的方法;两种惩罚因子作为ILA模型正则项的方法;Dropout Training引入实例加权学习中的方法。文本分类实验结果表明,正则化方法一定程度上都能够提高该实例迁移算法的性能,其中Dropout Training的效果最为显著。其次,针对领域适应中权重学习的过拟合问题,本文进行了系统的研究。虽然上述的正则化方法能够变相缓解过拟合问题,但并不能解决根本问题,且严重限制了算法的效率和适应性。因此,本文提出了基于损失函数惩罚的方法,根据实例的权重进行不同程度的损失函数惩罚。实验结果表明,基于损失函数惩罚的方法不仅能够明显改善过拟合问题,且具有较强的适应性和较高的效率,其中基于少数权重较大样本的损失函数惩罚方法效果是最优且最稳定的。最后,本文提出了一种基于终生学习的增量式朴素贝叶斯模型,在传统的朴素贝叶斯模型的基础上,提出了增量式的模型参数更新方式和终生式学习机制。该模型能够存储大规模历史任务中学习到的知识,有效辅助少量有样本标注的新任务的学习,并以增量的方式更新参数,每次学习只需更新历史模型却不必重复训练历史数据。在文本分类上的实验结果表明,该模型不仅能够增量式地利用过去任务中学习到的知识指导新任务的学习,而且还具有较好的新特征处理和领域自适应能力。
[Abstract]:With the rapid development of Internet technology, people can get more and more information on the Internet.The explosive growth of information has both advantages and disadvantages. How to make full use of this information efficiently and fully becomes an urgent problem to be solved in academia and industry.Text classification is a common technique to solve this kind of problem. It can be divided into domain specific and domain adaptive text classification according to the learning method.At present, there are many domain adaptation algorithms based on case migration, but there is a common phenomenon in this kind of methods, that is, the over-fitting problem caused by over-learning of case weights.It is understood that there has not been any work to discuss this problem explicitly, this paper will do a systematic study on it.In addition, in the field of natural language processing, the traditional statistical machine learning model is usually single-task, that is, the model is obtained from the training data in one time.This undoubtedly limits the generalization and extensibility of the algorithm.Firstly, this paper introduces the representative case-based domain adaptation algorithm ILA, and proposes a regularization method to enhance the effect of migration learning.The regularization method is divided into six submethods: three methods based on Early-stopping and two penalty factors as regular terms of ILA model.The results of text classification experiments show that the regularization method can improve the performance of the instance migration algorithm to some extent, and the effect of Dropout Training is the most significant.Secondly, aiming at the problem of over-fitting of weight learning in domain adaptation, this paper makes a systematic study.Although the above regularization method can alleviate the over-fitting problem in a disguised form, it can not solve the fundamental problem, and severely limits the efficiency and adaptability of the algorithm.Therefore, this paper presents a method of penalty based on loss function, which is based on the weight of an example.The experimental results show that the penalty method based on loss function can not only obviously improve the over-fitting problem, but also has strong adaptability and high efficiency.Among them, the penalty effect of loss function based on a few large weight samples is optimal and stable.Finally, an incremental naive Bayesian model based on lifelong learning is proposed. Based on the traditional naive Bayesian model, the incremental model parameter updating method and lifelong learning mechanism are proposed.The model can store the knowledge learned from large-scale history tasks, effectively assist the learning of a small number of new tasks with sample tagging, and update the parameters in an incremental manner. Each learning process only needs to update the historical model without repeatedly training the historical data.The experimental results on text classification show that the model can not only make incremental use of the knowledge learned in the past tasks to guide the learning of new tasks, but also have better ability of new feature processing and domain adaptation.
【学位授予单位】:南京理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP181
【参考文献】
相关期刊论文 前4条
1 许明英;尉永清;赵静;;一种结合反馈信息的贝叶斯分类增量学习方法[J];计算机应用;2011年09期
2 罗福星;刘卫国;;一种朴素贝叶斯分类增量学习算法[J];微计算机应用;2008年06期
3 姜卯生,王浩,姚宏亮;朴素贝叶斯分类器增量学习序列算法研究[J];计算机工程与应用;2004年14期
4 宫秀军,刘少辉,史忠植;一种增量贝叶斯分类模型[J];计算机学报;2002年06期
,本文编号:1706812
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/1706812.html