基于标记依赖关系的多标记学习算法研究

发布时间：2018-06-05 14:24

本文选题：数据挖掘 + 分类学习　；参考：《北京交通大学》2016年博士论文

【摘要】：多标记分类是机器学习和数据挖掘中一个重要的研究问题,其目的是为了预测实例同时具有的多个标记。在大多实际应用中,实例的多个标记之间存在着潜在的依赖关系,发掘其中蕴含的有用信息往往能够有效地提高分类模型的学习性能。因此,如何学习和利用标记间的依赖关系,已经成为当前多标记分类学习领域的关键问题之一。本文首先对研究现状进行了总结,分析了现有方法的优缺点。接着,探索了学习和利用不同类型和应用场景下的多标记间依赖关系的多种途径,提出了多种更加有效的多标记分类模型和算法。本文取得的研究成果主要如下：(1)分类器链等模型往往随机地为每个标记确定其所依赖的其它标记,因此可能获得与实际不符的结果。为解决这一问题,本文提出了一种利用树型贝叶斯网络来表示标记间依赖关系的方法。该方法通过明确度量多标记间依赖程度的大小,来构建一个以标记为节点,标记间依赖程度大小为权重的网络结构,从而能够合理地确定多标记间的依赖关系。进一步,还利用集成学习技术构建了多个可能的标记间依赖结构,从而能够更充分地考虑多标记间的相互依赖关系。实验结果验证了该算法的有效性,这表明通过度量标记间的依赖程度大小并充分考虑标记间的相互依赖关系,能够进一步提升分类模型的性能。(2)提出了一种利用图结构表示标记间的依赖程度,并将多标记间依赖关系的迭代传播表示成在图上的随机游走过程的多标记学习算法。该方法首先构建了标记间的图结构,并利用重启动随机游走模型来模拟标记间依赖关系在图中的迭代传播过程。对给定测试实例,该方法首先给出各标记为其真实标记的初始概率,然后采用类似PageRank的方法迭代地更新各标记的值直到收敛为止。这种迭代重复更新的过程使得,各标记不仅能考虑和其有直接依赖关系的标记对其的影响,也能考虑其它间接的依赖关系。实验结果表明,该算法在多种评价标准下都明显优于其它对比算法,尤其当数据集具有较多的标记时。这表明,考虑标记间依赖关系的迭代传播,能够更为有效地发掘和利用其中潜在的有用信息。(3)在上一种方法的基础上,进一步提出了一种能够考虑多种潜在因素,并通过最优化给定的目标函数来学习多标记间最优的依赖程度的多标记学习算法。该方法利用了多核学习的思想,首先基于不同的依赖关系定义,从不同方面给出了标记间依赖程度的多种度量结果,然后以这些度量为输入利用线性模型学习标记间的最终依赖程度。该方法的优势包括：一是能够综合考虑从不同角度出发的标记间依赖程度的度量；二是其通过最小化分类模型所采用的损失函数来估计线性模型的参数,因此能够学习到对当前分类任务最优的标记间依赖程度。实验结果表明,通过优化目标函数而学习到的标记间依赖关系和程度,和上一种方法等对比算法相比,该方法能明显地提升相应分类模型的性能。(4)针对弱标记和存在大量标记的问题,本文基于矩阵分解模型提出了一种学习最优的标记排序的方法。该方法能够将原标记空间映射到一个低维空间,从而能够显著地减少标记个数并因此降低计算量。对训练集中的每个实例,都可以获得两个标记集合：已经明确给出的标记,和其它没有明确给出的标记。现有方法中大多假设,若标记没有明确给出则即为实例的非相关标记(非1即0)。为避免该假设可能引入的错误信息,本文所提方法仅假设,对每个实例,和没有明确给出的标记相比,那些明确给出的标记更应该是实例的相关标记。相应地,该方法设计了一种类似AUC曲线的损失函数,并通过优化该损失函数使得在为实例预测的标记排序中,那些明确给出的标记都尽量排在没有明确给出的标记之前。因此,该方法能够在存在弱标记的情况下,充分利用标记间的依赖关系来产生一个更为合理的标记排序。实验结果验证了该方法在特定数据集合上有着更好的性能。以上研究成果从利用不同类型的标记依赖关系的角度出发,提出了相应的学习方法和模型并通过实验验证了其有效性,为实际应用和进一步研究奠定了良好的基础。
[Abstract]:Multi label classification is an important research problem in machine learning and data mining. The purpose is to predict multiple markers at the same time. In most practical applications, there is a potential dependency between multiple markers in the actual application, and the discovery of useful information contained in it can effectively improve the learning of the classification model. Therefore, how to learn and utilize the dependency between tags has become one of the key problems in the field of multi label classification learning. First, this paper summarizes the present situation and analyzes the advantages and disadvantages of the existing methods. Then, it explores how to learn and utilize the multi label dependence between different types and Application scenarios. A variety of more effective multi label classification models and algorithms are proposed. The main results obtained in this paper are as follows: (1) the classifier chain and other models are often randomly assigned to each tag to determine the other markers that they depend on, so it may obtain the results that are not in conformity with the actual situation. The Bias network represents the method of interdependency between markings. This method constructs a network structure which is marked as a node and the size of the dependency between markings is weighted, so that the dependence between multiple markers can be reasonably determined by measuring the size of the dependency degree between multiple markings. A number of possible inter label dependence structures are built to more fully consider the interdependence between multiple markers. The experimental results verify the effectiveness of the algorithm, which shows that the performance of the classification model can be further improved by measuring the dependence of the markers and taking full account of the interdependence between the markers. (2) proposed A graph structure is used to represent the dependence between markings, and the iterative propagation of the dependency relationship between multiple markers is represented as a multi label learning algorithm for random walk on a graph. This method first constructs the graph structure between tags, and uses reboot random walk model to simulate the iterative transmission of the inter label dependency in the graph. For a given test instance, the method first gives the initial probability of each mark as its real mark, and then iteratively updates the values of the tags to convergence by using a similar PageRank method. This iterative process makes the markers not only consider the effects of the tags that have direct dependence on them and their effects, too. We can consider other indirect dependencies. The experimental results show that the algorithm is obviously superior to other algorithms under a variety of evaluation criteria, especially when the data sets have more markers. This shows that the iterative propagation of the dependency relationship between tags can be more effective in discovering and utilizing potential useful information. (3) in the last one On the basis of the method, we propose a multi label learning algorithm which can consider a variety of potential factors and learn the optimal dependence degree between multiple markers by optimizing a given objective function. This method uses the idea of multi-core learning, first based on the definition of different dependency relations, and gives the inter label dependence from different aspects. The results of a variety of degrees are then used to learn the final dependence between markings using the linear model as input. The advantages of this method include: first, it is able to consider the measure of the dependence among the markings from different angles, and two is to estimate the linear model by minimizing the loss function used by the minimized classification model. The parameters of the type are therefore able to learn the degree of dependence between the markers that are optimal for the current classification tasks. The experimental results show that the correlation and degree between the markers learned by optimizing the target function and the level of the previous method can obviously improve the performance of the phase stress classification model. (4) for the weak markup and existence, In this paper, based on the matrix decomposition model, this paper proposes a method of learning optimal label ordering. This method can map the original markup space into a low dimensional space, thus can significantly reduce the number of tags and thus reduce the amount of computation. For each instance of the training set, two tag sets can be obtained: already Most of the existing methods assume that if the markup is not explicitly given is an unrelated mark of an instance (not 1 or 0). In order to avoid the error information that the hypothesis may introduce, the proposed method only assumes that for each instance, compared with the unexplicitly given markup, the method is assumed. The markup that is given should be the correlation marker of the instance. Accordingly, the method designs a loss function similar to the AUC curve, and by optimizing the loss function, the explicitly given markings are arranged before the clearly given markup in the case prediction, so the method can exist. In the case of weak markup, a more reasonable markup sort is produced by making full use of the dependency relationship between tags. The experimental results show that the method has better performance on a specific set of data. The above results are based on the use of different types of label dependence and the corresponding learning methods and models are put forward. The validity of the method is verified by experiments, which lays a good foundation for practical application and further research.
【学位授予单位】：北京交通大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP181

【相似文献】