一种基于二部图的迁移学习算法

发布时间：2018-05-12 19:00

本文选题：文本分类 + 迁移学习　；参考：《广东外语外贸大学》2017年硕士论文

【摘要】：文本是互联网中一种常见的数据表现形式。然而,互联网迅猛发展导致大量冗余数据的产生给数据生产者、管理者以及消费者均形成极大的负担。针对这一问题,学者们提出了基于机器学习的文本自动分类方法,用以管理网络文本数据,从而解决因数据冗余造成人力成本浪费的问题。然而,互联网文本数据时效性强及新旧文本领域差异性大的特点,又会导致旧有的、已标注的文本和新生成的文本在特征空间上不满足独立同分布,也即,不能将旧有的、已标注的数据直接训练模型用于新生成的数据的自动分类任务上。为解决这一问题,迁移学习提出一种知识迁移的思路,使不同又相似的领域或任务能够借用旧有的知识进行知识的迁移。可即便如此,目前的迁移学习算法仍存在其局限性,如解释性较差、效率低等问题。基于上述研究背景,本文在综述了文本自动分类和迁移学习常用的关键技术后,提出一种基于二部图的迁移学习算法。该算法的主要思路是:首先,对文本数据进行特征提取和特征选择,联合源领域和目标领域的文档及特征构建文档-特征二部图;接着,基于构建的二部图,计算联合领域中任意两个特征之间的传递关系,以任意特征之间的传递关系作为知识迁移的桥梁,将目标领域的文档的特征空间映射到源领域的特征空间中;然后,对源领域的、已标注的文本,采用经典的机器学习分类器进行模型训练;最后,利用源领域的模型对目标领域的文档进行文本自动分类。通过参数实验、分类器实验、对比实验以及可解释性实验证明提出的算法能够有效地解决迁移学习中的解释性问题以及效率提升的问题。
[Abstract]:Text is a common form of data representation in the Internet. However, the rapid development of the Internet has led to a great deal of redundant data production to data producers, managers and consumers have formed a great burden. To solve this problem, scholars put forward an automatic text classification method based on machine learning to manage network text data, thus solving the problem of human cost waste caused by data redundancy. However, the characteristics of strong timeliness of Internet text data and great differences between new and old text fields will lead to old text, tagged text and newly generated text do not satisfy independent distribution in the feature space, that is, the old text cannot be distributed. The annotated data direct training model is used for automatic classification of newly generated data. In order to solve this problem, transfer learning proposes a knowledge transfer approach, which enables different and similar fields or tasks to transfer knowledge by using old knowledge. But even so, the current transfer learning algorithm still has its limitations, such as poor explanation, low efficiency and so on. Based on the above research background, this paper presents a bipartite graph based transfer learning algorithm after summarizing the key technologies of text automatic classification and transfer learning. The main ideas of the algorithm are as follows: firstly, the text data are extracted and selected, and the documents of source domain and target domain are combined to construct document-feature bipartite graph, and then, based on the bipartite graph constructed, The transfer relation between any two features in a joint domain is calculated, and the transfer relation between arbitrary features is used as a bridge for knowledge transfer. The document feature space of the target domain is mapped to the feature space of the source domain. The text tagged in source domain is trained by classical machine learning classifier. Finally, the document of target domain is automatically classified by using the model of source domain. Through parameter experiment, classifier experiment, contrast experiment and interpretable experiment, it is proved that the proposed algorithm can effectively solve the problem of explanation and efficiency improvement in transfer learning.
【学位授予单位】：广东外语外贸大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】