半监督学习分类算法的研究

发布时间：2018-04-04 21:34

本文选题：半监督学习　切入点：数据驱动　出处：《江苏大学》2017年硕士论文

【摘要】：机器学习已成为计算机获取知识的重要途径和人工智能的重要标志。传统的机器学习技术需要使用大量有标记样本进行训练,然而在很多实际应用中,获取大量的有标记样本相当困难,而获取大量未标记样本则相对容易得多。因此,只需标注少量样本的半监督学习方法在模式识别和机器学习领域引起了极大的关注。本文主要针对半监督学习的聚类与分类问题展开研究,完成的主要工作如下:根据半监督学习理论中协同训练的思想,本文提出一种基于协同训练的支持向量机分类算法。该算法通过两个不同的SVM分类器,获取已标记样本中的信息,再分别预测未标记样本的标类。利用相互验证方法筛选具有高置信度的结果,扩充标记样本,根据扩充后的标记样本,更新训练器实现半监督学习。该方法在保证识别精度情况下,简化了学习过程。利用UCI数据集,结合DAG-SVMs多分类策略证明了在标记样本较少的情况下本算法具有较高的分类精度,最后将算法应用于原核蛋白蛹化点位的分类,获得了良好的效果。针对当初始标记样本量过少而导致的半监督学习无法有效修正学习器的问题,本文提出一种基于聚类分析的自训练SVM分类算法。该算法首先选用半监督模糊c均值聚类算法,挖掘整体样本信息,再使用自训练SVM实现样本分类,算法中通过二次筛选方法减少了错分概率。本文考虑到时间序列的特殊性质,依据结构学习原理,提出一种有监督重构算法,实现对原始时间序列的降维和特征提取。最后通过UCR数据集实验证明了本算法的有效性,并将算法应用到化学物质细胞毒性评估实验边缘效应的检测,获得了良好的检测效果。
[Abstract]:Machine learning has become an important way for computer to acquire knowledge and an important symbol of artificial intelligence.Traditional machine learning technology needs to use a large number of labeled samples for training. However, in many practical applications, it is difficult to obtain a large number of labeled samples, but it is much easier to obtain a large number of unlabeled samples.Therefore, semi-supervised learning with only a small number of samples has attracted much attention in the field of pattern recognition and machine learning.This paper focuses on the clustering and classification of semi-supervised learning. The main work is as follows: according to the idea of cooperative training in semi-supervised learning theory, this paper proposes a support vector machine classification algorithm based on cooperative training.The algorithm uses two different SVM classifiers to obtain the information from the labeled samples and then predict the unlabeled samples respectively.The results with high confidence are screened by mutual verification method, and the labeled samples are expanded. According to the expanded tag samples, the semi-supervised learning is realized by updating the training device.This method simplifies the learning process under the condition that the recognition accuracy is guaranteed.Using UCI data set and DAG-SVMs multi-classification strategy, it is proved that this algorithm has higher classification accuracy when the number of labeled samples is small. Finally, the algorithm is applied to the classification of pupae position of prokaryotic protein, and good results are obtained.In order to solve the problem that semi-supervised learning can not effectively correct the learner when the initial sample size is too small, a self-training SVM classification algorithm based on clustering analysis is proposed in this paper.First, the semi-supervised fuzzy c-means clustering algorithm is used to mine the whole sample information, and then the self-training SVM is used to realize the classification of samples. In the algorithm, the probability of misdivision is reduced by using the quadratic filtering method.Considering the special properties of time series and based on the principle of structural learning, a supervised reconstruction algorithm is proposed to extract the dimensionality of the original time series.Finally, the effectiveness of the algorithm is proved by the UCR dataset experiment, and the algorithm is applied to the detection of the edge effect in the cytotoxicity assessment experiment of chemical substances, and a good detection effect is obtained.
【学位授予单位】：江苏大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】