多标签学习中关键问题研究
本文选题:多标签学习 + 多标签分类 ; 参考:《西安电子科技大学》2016年博士论文
【摘要】:随着科技的发展,越来越多的应用涉及到多标签问题,如文本分类、图像标注、基因功能分析等。与传统的单标签(二类分类或多类分类)问题不同,多标签问题中允许一个示例可同时与多个标签相关联,标签之间存在更丰富的标签关系,导致多标签问题的分析变得更加复杂。多标签学习研究的是如何给多标签问题中的待测示例赋予所有合适的类别标签。由于标签关系的存在,多标签学习比传统的单标签学习复杂得多,更加难以分析。出于应用需求,越来越多的研究人员开始多标签学习研究。多标签学习研究已成为机器学习和模式识别领域的研究热点之一。虽然多标签学习研究已经取得了很大的进展,但其仍面临着一些关键问题的挑战,如已有多标签分类算法的分类性能仍有待提高、较高的标签空间维度导致较高的训练和测试时间成本以及较高的特征空间维度容易导致训练模型过拟合等亟待解决的挑战性问题。因此,多标签分类、标签空间降维和多标签维度约简是目前多标签学习研究中的三个重点研究方面。其中,多标签分类算法研究以提升分类性能为目标;标签空间降维算法以降低标签空间的维度为手段利用标签关系,以期提高分类性能,同时减少训练和测试时间;多标签维度约简用于解决多标签学习中的“维度灾难”问题,通过降低特征空间的维度,以获得更好的示例表示。本论文正是围绕这三个方面开展多标签学习研究,主要工作包括以下几点:1.鉴于标签间常常有簇状标签关系,提出了基于簇状本征标签关系的多标签分类算法。该算法中每个标签的权值向量由公共分量和独有分量两部分构成。公共分量是所有标签共有的部分,对应示例中的背景信息;独有分量归单个标签所有,对应示例中该标签的独有信息,标签之间的本征关系反映在独有分量之间的关系上,而标签之间往往有簇状关系。本文所提出的方法基于上述权值向量结构对支持向量机进行扩展,在所有标签的独有分量上通过施加簇状关系正则项利用簇状标签关系提高分类性能。通过放松正交约束条件,文中将非凸问题变为联合凸的半正定规划问题,并利用基于交替迭代更新规则的块坐标下降方法提出了该问题的一种优化方法。实验结果表明,所提出算法的分类性能明显优于相关多标签分类算法。2.针对现有多标签分类算法中所有标签用同一示例进行训练的问题,提出了一种利用示例分布情况为每个标签构造更易判别的新示例表示的多标签分类算法。由于同一示例表示无法较好地反映各标签的特点,为此,所提出的算法基于一对所有策略将多标签分类问题转化为多个二类分类子问题,每个标签对应一个子问题。每个子问题中正、负示例局部结构之间的关联关系对构造高效分类模型有着很重要的作用,为挖掘这些关联关系,本文提出了一种新的谱聚类方法一谱示例校准。所提出的多标签分类算法利用谱示例校准算法得到聚类结果为每个标签构建更符合标签特点的示例表示,然后基于新的示例表示训练二类分类模型。实验结果验证了该算法的有效性。3.为在标签空间降维过程中充分利用示例信息,提出了一种基于依赖最大化(Dependence maximization)的标签空间降维算法。该算法的目标函数包括两部分:编码损失和依赖损失。编码损失衡量用主成分分析方法对标签矩阵压缩过程中的信息损失。当标签向量经过降维变成码字向量后,还需学习从特征空间到码字空间的回归模型,故示例和码字向量之间的关系很重要,依赖损失便是用来衡量两者之间依赖关系的损失情况。为利用示例信息,所提出的算法首次用希尔伯特-施密特独立标准来衡量依赖损失,以能更充分地挖掘并利用示例和码字向量之间的依赖关系。此外,我们还探讨了两种不同示例核矩阵对所提出算法性能的影响,其中一种示例核矩阵基于全局结构信息,另一种示例核矩阵基于局部潜在结构信息。实验结果表明,该算法不仅大大缩短了训练和测试时间,还能有效提高分类性能:利用后一种示例核矩阵的算法具有更好的分类性能,而其训练和测试时间与利用前一种示例核矩阵的算法相当。4.针对示例和标签向量中的孤立点问题,本文提出了一种基于l2.1范数的鲁棒标签空间降维算法。由于数据采集设备的问题,数据集的示例中往往存在孤立点问题;标签向量孤立点是指与标签空间降维算法中所利用的主要标签关系明显不符的标签向量。目标函数包括编码损失和依赖损失两部分。编码损失衡量用主成分分析方法对标签矩阵压缩过程中的信息损失。依赖损失衡量示例和码字向量间线性回归关系的损失情况。为解决孤立点问题,该算法目标函数中的编码损失和依赖损失均采用l2.1范数。所得到的目标问题是一个非光滑问题,本文提出的变形交替迭代更新方法有效地解决了该问题,并对其进行了收敛性分析。实验结果表明,所提出的鲁棒标签空间降维既能缩短训练和测试时间,又能提高分类性能。此外,在标签受污染的数据集上的实验结果表明,与其它标签空间降维算法相比,该算法具有更好的鲁棒性。5.现有多标签维度约简方法没有利用局部潜在结构,而传统维度约简方法研究已表明这些结构的有用性。为此,本文提出了一种新的多标签维度约简方法一多标签局部判别嵌入。该方法利用与实际情况更符合的非对称标签关系矩阵,这样既赋予了包含信息量多的示例更大的权重,又克服多标签学习中的过计数问题;通过构建两个邻接图集合来分析局部潜在结构,以更好地挖掘并利用数据内部的几何结构,使维度约简结果有更好的类内紧致性和类间可分性。通过对得到的优化问题施加正交约束条件,获得一组正交投影向量。实验结果表明,与相关多标签维度约简方法相比,该方法的维度约简结果更合理,能产生更有判别信息的特征,从而取得更好的分类精度。
[Abstract]:With the development of science and technology, more and more applications involve multi label problems, such as text classification, image annotation, gene function analysis, etc.. Different from the traditional single label (two class classification or multi class classification) problem, the multi label problem allows one example to be associated with multiple labels simultaneously, and there is a more rich label relationship between the labels. The analysis of multiple label problems becomes more complex. Multi label learning studies how to give all appropriate category labels to examples in the multi label problem. Because of the existence of the label relationship, multi label learning is much more complex and difficult to analyze than traditional single label learning. More and more researchers, out of application requirements, have become more and more researchers. Multi label learning has become one of the hotspots in the field of machine learning and pattern recognition. Although much progress has been made in the study of multi label learning, it still faces some key challenges, such as the classification performance of the existing multi label classification algorithms still needs to be improved and the label space is higher. Dimensionality leads to higher training and test time cost and high feature space dimension easily leads to the challenge of training model overfitting. Therefore, multi label classification, label spatial reduction and multi label dimension reduction are three key research aspects of multi label learning. The objective of the study is to improve the classification performance. The label space reduction algorithm uses the label relationship to reduce the dimension of the label space as a means to improve the classification performance, while reducing the training and testing time. In order to obtain a better example, this thesis is to carry out the study of multi label learning around these three aspects. The main work includes the following points: 1. in view of the often clustered label relationship between tags, a multi label classification algorithm based on cluster eigenvalue label relations is proposed. There are two components. The common component is the common part of all labels, corresponding to the background information in the example; the unique component belongs to the single label, corresponding to the unique information of the label in the example, the intrinsic relationship between the tags is reflected in the relationship between the unique components, and the label often has a cluster relationship. This method extends the support vector machine based on the weight vector structure above, and improves the classification performance by applying the cluster relation regular term on the unique component of all labels. By relaxing the orthogonal constraint conditions, the non convex problem is transformed into a joint convex semi positive programming problem, and the alternative iteration is used to make use of the alternate iteration more. The block coordinate descending method of the new rule proposes an optimization method of this problem. The experimental results show that the classification performance of the proposed algorithm is obviously better than that of the related multi label classification algorithm.2., which uses the same example for all the tags in the existing multi label classification algorithm. The multi label classification algorithm represented by the new example is more easily discriminating. Because the same example is not good to reflect the characteristics of each label, the proposed algorithm is based on a pair of all strategies to transform the multi label classification problem into multiple two class classification subproblems, each tag corresponds to a sub problem. Each sub problem is positive, The correlation between negative examples of local structures plays an important role in constructing an efficient classification model. In order to excavate these relationships, a new spectral clustering method, a spectral example calibration, is proposed in this paper. The proposed multi label classification algorithm uses the spectral example calibration algorithm to get the clustering results for each label more conforming to the label. The characteristics of the example are expressed, and then the two class classification models are trained based on the new example. The experimental results verify that the validity of the algorithm.3. is to make full use of the example information in the process of reducing the dimension of the label space. A space reduction algorithm based on the dependency maximization (Dependence maximization) is proposed. The target function of the algorithm includes the algorithm. The two part: coding loss and dependence loss. The code loss measure uses principal component analysis method to reduce the information loss in the label matrix compression process. When the label vector passes the dimension reduction to the codeword vector, it is necessary to learn the regression model from the feature space to the codeword space, so the relationship between the example and the codeword vector is very important and depends on the loss. It is used to measure the loss of dependence between the two. For the first time, the proposed algorithm uses the Hilbert Schmidt independent standard to measure the dependence loss for the first time, so that the dependence between the example and the codeword vector can be more fully excavated and used. In addition, we also discuss two different examples of the kernel matrix pairs. One example kernel matrix is based on global structure information, and the other example kernel matrix is based on local potential structure information. The experimental results show that the algorithm not only greatly reduces the training and test time, but also improves the classification performance effectively: the algorithm of the latter example kernel matrix has better classification. Yes, while its training and testing time is equivalent to the algorithm of the previous example kernel matrix using.4., a robust tag space reduction algorithm based on l2.1 norm is proposed in this paper, which is based on the problem of data acquisition equipment. The outlier is a label vector which is obviously incompatible with the main label relationship in the dimension reduction algorithm of the label space. The target function includes two parts of the coding loss and the dependence loss. The loss of information in the compression process of the tag matrix using the principal component analysis method, the example of the loss imbalance and the linear return between the codeword vectors. In order to solve the problem of the outlier, the l2.1 norm is used for both the coding loss and the dependence loss in the objective function of the algorithm. The target problem is a non smooth problem. The proposed alternation iterative updating method is effective in solving the problem, and the convergence analysis is carried out. The experimental results show that the problem is not smooth. The proposed robust label space reduction can not only shorten the training and test time, but also improve the performance of the classification. In addition, the experimental results on the contaminated data set show that the algorithm has better robustness compared with the other label space reduction algorithms, and the existing multi label dimensionality reduction method has not made use of the local potential structure for.5.. The study of the traditional dimensionality reduction method has shown the usefulness of these structures. For this reason, a new multi label dimensionality reduction method with multi label local discriminant embedding is proposed. This method uses the asymmetric label relation matrix which is more consistent with the actual situation, so it not only gives a larger weight of the example with more information in the packet, but also overcomes the fact that the packet has more information. The problem of counting the over counting in multi label learning; by constructing two adjacent atlas to analyze the local potential structure to better excavate and utilize the geometric structure of the data, make the result of dimension reduction have better intra class compactness and interclass separability. By applying orthogonal constraints to the optimized questions obtained, a set of orthogonal input is obtained. The experimental results show that, compared with the related multi label dimension reduction method, the dimensional reduction results of the proposed method are more reasonable and can produce more discriminant information, thus achieving better classification accuracy.
【学位授予单位】:西安电子科技大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP181
【相似文献】
相关期刊论文 前10条
1 林茜卡;傅秀芬;滕少华;李云;;协同标签系统的应用研究[J];暨南大学学报(自然科学与医学版);2009年01期
2 吴超;周波;;基于复杂网络的社会化标签分析[J];浙江大学学报(工学版);2010年11期
3 吴金成;曹娇;赵文栋;张磊;;标签集中式发布订阅机制性能分析[J];指挥控制与仿真;2010年06期
4 李晓燕;陈刚;寿黎但;董金祥;;一种面向协作标签系统的图片检索聚类方法[J];中国图象图形学报;2010年11期
5 袁柳;张龙波;;基于概率主题模型的标签预测[J];计算机科学;2011年07期
6 张斌;张引;高克宁;郭朋伟;孙达明;;融合关系与内容分析的社会标签推荐[J];软件学报;2012年03期
7 王永刚;严寒冰;许俊峰;胡建斌;陈钟;;垃圾标签的抵御方法研究[J];计算机研究与发展;2013年10期
8 汪祥;贾焰;周斌;陈儒华;韩毅;;基于交互关系的微博用户标签预测[J];计算机工程与科学;2013年10期
9 顾亦然;陈敏;;一种三部图网络中标签时间加权的推荐方法[J];计算机科学;2012年08期
10 赵亚楠;董晶;董佳梁;;基于社会化标注的博客标签推荐方法[J];计算机工程与设计;2012年12期
相关会议论文 前6条
1 朱广飞;董超;王衡;汪国平;;照片标签的智能化管理[A];第四届和谐人机环境联合学术会议论文集[C];2008年
2 房冠南;袁彩霞;王小捷;李江;宋占江;;面向对话语料的标签推荐[A];中国计算语言学研究前沿进展(2009-2011)[C];2011年
3 梅放;林鸿飞;;基于社会化标签的移动音乐检索[A];第五届全国信息检索学术会议论文集[C];2009年
4 李静;林鸿飞;;基于用户情感标签的音乐检索算法[A];第六届全国信息检索学术会议论文集[C];2010年
5 骆雄武;万小军;杨建武;吴於茜;;基于后缀树的Web检索结果聚类标签生成方法[A];第四届全国信息检索与内容安全学术会议论文集(上)[C];2008年
6 王波;唐常杰;段磊;尹佳;左R,
本文编号:2073918
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2073918.html