计算机辅助医学影像诊断中的关键学习技术研究
发布时间:2018-09-17 16:42
【摘要】:利用计算机技术辅助放射科医生进行病例诊断,即计算机辅助诊断(Computer Aided Diagnosis, CAD)在早期乳腺癌检查中起到越来越重要的作用,能有效帮助减少乳腺癌患者的死亡率。临床上已标记病例样本难以搜集同时阴性病例样本数远大于阳性病例样本数,因而在CAD应用中存在小样本、非平衡数据集的学习问题。非平衡及小样本学习问题是关于类别严重不对称及信息欠充分表达数据集的学习性能问题。非平衡及小样本学习在许多现实应用中具有重要意义,尽管经典机器学习与数据挖掘技术在许多实际应用中取得很大成功,然而针对小样本及非平衡数据的学习对于学者们来说仍然是一个很大的挑战。本论文系统地阐述了机器学习在小样本与非平衡学习环境下性能下降的主要原因,并就目前解决小样本、非平衡学习问题的有效方法进行了综述。本论文在充分理解常用欠采样方法在处理非平衡样本时易于丢失类别信息的问题基础上,重点研究如何合理、有效处理非平衡数据。论文提出两种欠采样新方法有效提取最富含类别信息的样本以此解决欠采样引起的类别信息丢失问题。另外针对小样本学习问题,论文提出新的类别标记算法。该算法通过自动标记未标记样本扩大训练样本集,同时有效减少标记过程中易发生的标记错误。 本论文聚焦小样本、非平衡数据的学习技术研究。围绕非平衡数据集的重采样及未标记样本的类别标记等问题展开研究。论文的主要工作包括: (1)针对CAD应用中标记病例样本难以收集所引起的小样本学习问题,本论文利用大量存在的未标记样本来扩充训练样本集以此解决小样本学习问题。然而样本标记过程中往往存在错误类别标记,误标记样本如同噪声会显著降低学习性能。针对半监督学习中的误标记问题,本论文提出混合类别标记(Hybrid Class Labeling)算法,算法从几何距离、概率分布及语义概念三个不同角度分别进行类别标记。三种标记方法基于不同原理,具有显著差异性。将三种标记方法有一致标记结果的未标记样本加入训练样本集。为进一步减少可能存在的误标记样本对学习过程造成的不利影响,算法将伪标记隶属度引入SVM(Support Vector Machine)学习中,由隶属度控制样本对学习过程的贡献程度。基于UCI中Breast-cancer数据集的实验结果表明该算法能有效地解决小样本学习问题。相比于单一的类别标记技术,该算法造成更少的错误标记样本,得到显著优于其它算法的学习性能。 (2)针对常用欠采样技术在采样过程中往往会丢失有效类别信息的问题,本论文提出了基于凸壳(Convex Hull,CH)结构的欠采样新方法。数据集的凸壳是包含集合中所有样本的最小凸集,所有样本点都位于凸壳顶点构成的多边形或多面体内。受凸壳的几何特性启发,算法采样大类样本集得到其凸壳结构,以简约的凸壳顶点替代大类训练样本达到平衡样本集的目的。鉴于实际应用中两类样本往往重叠,对应凸壳也将重叠。此时采用凸壳来表征大类的边界结构对学习过程是一个挑战,容易引起过学习及学习机的泛化能力下降。考虑到缩减凸壳(Reduced Convex Hull,RCH)、缩放凸壳(Scaled Convex Hull,SCH)在凸壳缩减过程中带来边界信息丢失的问题,我们提出多层次缩减凸壳结构(Hierarchy Reduced Convex Hull,HRCH)。受RCH与SCH结构上存在显著差异性及互补性的启发,我们将RCH与SCH进行融合生成HRCH结构。相比于其它缩减凸壳结构,HRCH包含更多样、互补的类别信息,有效减少凸壳缩减过程中类别的信息丢失。算法通过选择不同取值的缩减因子与缩放因子采样大类,所得多个HRCH结构分别与稀有类样本组成训练样本集。由此训练得多个学习机,并通过集成学习产生最终分类器。通过与其它四种参考算法的实验对比分析,该算法表现出更好分类性能及鲁棒性。 (3)针对欠采样算法中类别信息的丢失问题,本论文进一步提出基于反向k近邻的欠采样新方法,RKNN。相比于广泛采用的k近邻,反向k近邻是基于全局的角度来检查邻域。任一点的反向k近邻不仅与其周围邻近点有关,也受数据集中的其余点影响。样本集的数据分布改变会导致每个样本点的反向最近邻关系发生变化,它能整体反应样本集的完整分布结构。利用反向最近邻将样本相邻关系进行传递的特点,克服最近邻查询仅关注查询点局部分布的缺陷。该算法针对大类样本集,采用反向k最近邻技术去除噪声、不稳定的边界样本及冗余样本,保留最富含类别信息且可靠的样本作为训练样本。算法在平衡训练样本的同时有效改善了欠采样引起的类别信息丢失问题。基于UCI中Breast-cancer数据集的实验结果验证了该算法解决非平衡学习问题的有效性。相比于基于k最近邻的欠采样方法,RKNN算法得到了更好的性能表现。
[Abstract]:Computer aided diagnosis (CAD) plays an increasingly important role in early breast cancer screening and can effectively help reduce the mortality of breast cancer patients. For the number of positive case samples, there are small sample, unbalanced data sets learning problems in CAD applications. Unbalanced and small sample learning problems are about the learning performance of data sets with serious class asymmetry and insufficient information representation. Machine learning and data mining have achieved great success in many practical applications, but learning from small samples and unbalanced data is still a great challenge for scholars. This paper systematically expounds the main reasons for the performance degradation of machine learning in small samples and unbalanced learning environments, and proposes solutions to these problems. In this paper, based on a thorough understanding of the problem that under-sampling methods are easy to lose class information when dealing with non-equilibrium samples, we focus on how to deal with non-equilibrium data reasonably and effectively. Two new methods of under-sampling are proposed to extract the richest class information effectively. In addition, a new class labeling algorithm is proposed to solve the problem of class information loss caused by under-sampling. This algorithm enlarges the training sample set by automatically labeling unlabeled samples and effectively reduces the labeling errors in the labeling process.
This dissertation focuses on the study of small sample, unbalanced data learning technology. It focuses on the resampling of unbalanced data sets and the labeling of unlabeled samples.
(1) In order to solve the problem of small sample learning caused by the difficulty in collecting labeled case samples in CAD applications, this paper uses a large number of unlabeled samples to expand the training sample set to solve the problem of small sample learning. To solve the problem of mislabeling in semi-supervised learning, Hybrid Class Labeling algorithm is proposed in this paper. The algorithm labels classes from three different perspectives: geometric distance, probability distribution and semantic concepts. Results Unlabeled samples were added to the training sample set. To further reduce the possible adverse effects of mislabeled samples on the learning process, pseudo-labeled membership was introduced into SVM (Support Vector Machine) learning, and the contribution of the samples to the learning process was controlled by the membership degree. The results show that the algorithm can effectively solve the problem of small sample learning. Compared with the single class labeling technique, the algorithm results in fewer error labeling samples, and the learning performance of the algorithm is significantly better than that of other algorithms.
(2) In order to solve the problem that under-sampling often loses valid class information in the process of sampling, a new method of under-sampling based on Convex Hull (CH) structure is proposed in this paper. Inspired by the geometric characteristics of convex hulls, the algorithm sampled a large class of samples to obtain their convex hulls, and replaced the large class of training samples with a simple convex hull vertex to achieve the goal of balancing the sample set. Considering the loss of boundary information caused by scaled convex hull (SCH) in the process of convex hull reduction, we propose a multi-level reduced convex hull (HRCH). Inspired by significant differences and complementarities in structure, we fuse RCH and SCH to generate HRCH structure. Compared with other reduced convex hull structures, HRCH contains more diverse and complementary class information, which can effectively reduce the loss of class information in convex hull reduction process. The algorithm samples large classes by choosing reduction factors and scaling factors with different values. The resulting HRCH structure consists of a set of training samples and a set of rare class samples, from which multiple learning machines are trained and the final classifier is generated by ensemble learning.
(3) Aiming at the loss of class information in the under-sampling algorithm, this paper proposes a new under-sampling method based on the reverse k-nearest neighbor, RKNN. Compared with the widely used k-nearest neighbor, the reverse k-nearest neighbor is based on the global perspective to check the neighborhood. The change of the data distribution of the sample set will lead to the change of the reverse nearest neighbor relation of each sample point, which can reflect the whole distribution structure of the sample set. The algorithm balances the training samples and effectively improves the problem of class information loss caused by under-sampling. The experimental results based on Breast-cancer data set in UCI show that the proposed algorithm is effective in reducing noise, unstable boundary samples and redundant samples. Compared with the k nearest neighbor method, the RKNN algorithm has better performance.
【学位授予单位】:浙江大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:R81-39
本文编号:2246516
[Abstract]:Computer aided diagnosis (CAD) plays an increasingly important role in early breast cancer screening and can effectively help reduce the mortality of breast cancer patients. For the number of positive case samples, there are small sample, unbalanced data sets learning problems in CAD applications. Unbalanced and small sample learning problems are about the learning performance of data sets with serious class asymmetry and insufficient information representation. Machine learning and data mining have achieved great success in many practical applications, but learning from small samples and unbalanced data is still a great challenge for scholars. This paper systematically expounds the main reasons for the performance degradation of machine learning in small samples and unbalanced learning environments, and proposes solutions to these problems. In this paper, based on a thorough understanding of the problem that under-sampling methods are easy to lose class information when dealing with non-equilibrium samples, we focus on how to deal with non-equilibrium data reasonably and effectively. Two new methods of under-sampling are proposed to extract the richest class information effectively. In addition, a new class labeling algorithm is proposed to solve the problem of class information loss caused by under-sampling. This algorithm enlarges the training sample set by automatically labeling unlabeled samples and effectively reduces the labeling errors in the labeling process.
This dissertation focuses on the study of small sample, unbalanced data learning technology. It focuses on the resampling of unbalanced data sets and the labeling of unlabeled samples.
(1) In order to solve the problem of small sample learning caused by the difficulty in collecting labeled case samples in CAD applications, this paper uses a large number of unlabeled samples to expand the training sample set to solve the problem of small sample learning. To solve the problem of mislabeling in semi-supervised learning, Hybrid Class Labeling algorithm is proposed in this paper. The algorithm labels classes from three different perspectives: geometric distance, probability distribution and semantic concepts. Results Unlabeled samples were added to the training sample set. To further reduce the possible adverse effects of mislabeled samples on the learning process, pseudo-labeled membership was introduced into SVM (Support Vector Machine) learning, and the contribution of the samples to the learning process was controlled by the membership degree. The results show that the algorithm can effectively solve the problem of small sample learning. Compared with the single class labeling technique, the algorithm results in fewer error labeling samples, and the learning performance of the algorithm is significantly better than that of other algorithms.
(2) In order to solve the problem that under-sampling often loses valid class information in the process of sampling, a new method of under-sampling based on Convex Hull (CH) structure is proposed in this paper. Inspired by the geometric characteristics of convex hulls, the algorithm sampled a large class of samples to obtain their convex hulls, and replaced the large class of training samples with a simple convex hull vertex to achieve the goal of balancing the sample set. Considering the loss of boundary information caused by scaled convex hull (SCH) in the process of convex hull reduction, we propose a multi-level reduced convex hull (HRCH). Inspired by significant differences and complementarities in structure, we fuse RCH and SCH to generate HRCH structure. Compared with other reduced convex hull structures, HRCH contains more diverse and complementary class information, which can effectively reduce the loss of class information in convex hull reduction process. The algorithm samples large classes by choosing reduction factors and scaling factors with different values. The resulting HRCH structure consists of a set of training samples and a set of rare class samples, from which multiple learning machines are trained and the final classifier is generated by ensemble learning.
(3) Aiming at the loss of class information in the under-sampling algorithm, this paper proposes a new under-sampling method based on the reverse k-nearest neighbor, RKNN. Compared with the widely used k-nearest neighbor, the reverse k-nearest neighbor is based on the global perspective to check the neighborhood. The change of the data distribution of the sample set will lead to the change of the reverse nearest neighbor relation of each sample point, which can reflect the whole distribution structure of the sample set. The algorithm balances the training samples and effectively improves the problem of class information loss caused by under-sampling. The experimental results based on Breast-cancer data set in UCI show that the proposed algorithm is effective in reducing noise, unstable boundary samples and redundant samples. Compared with the k nearest neighbor method, the RKNN algorithm has better performance.
【学位授予单位】:浙江大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:R81-39
【参考文献】
相关期刊论文 前2条
1 杨风召,朱扬勇;一种有效的量化交易数据相似性搜索方法[J];计算机研究与发展;2004年02期
2 沈晔;李敏丹;夏顺仁;;计算机辅助乳腺癌诊断中的非平衡学习技术[J];浙江大学学报(工学版);2013年01期
,本文编号:2246516
本文链接:https://www.wllwen.com/yixuelunwen/yundongyixue/2246516.html
最近更新
教材专著