基于SOM聚类变量选择方法的共识模型在近红外光谱数据中的应用

发布时间：2018-06-24 00:04

本文选题：定量分析 + 共识模型　；参考：《温州大学》2017年硕士论文

【摘要】：数据建模是化学计量学研究的重要内容,根据数据建模的任务不同,可以分为定量分析和定性分析。目前,单模型建模是数据建模中常用的方法,即反复分析测量数据的过程中,建立一系列预测模型,选出一个预测性能最好的模型。然而,现代高通量分析仪器的成千上万个分析通道为测量样本提供了丰富的测量数据,但常遇到样本少,变量多的问题,采用单模型的方法就难以满足其建模要求。为了弥补单模型建模方法的不足,近年来,多模型共识建模在很多领域得到广泛的研究和应用,共识建模则是通过某种建模方法建立多个成员模型,并用某种共识策略结合起多个成员模型对未知样品进行预测,形成一个共识结果,以提高模型的预测精度和可靠性。本文将共识建模方法应用于近红外光谱数据,并对线性共识成员模型和非线性共识的成员模型进行探讨,主要内容如下:介绍选题的背景和意义,分析数据建模的基本原理及本文应用的建模方法。研究变量选择多回归成员模型共识建模方法,分析变量选择的优势,提出了一种基于偏最小二乘的共识模型(C-SOM-PLS)和基于最小二乘支持向量机的共识模型(C-SOM-LS-SVM),即分别是线性多成员共识模型和非线性多成员共识模型。建模方法是先通过Kohonen自组织特征映射网络(SOM)聚类算法对变量进行选择,使相似的变量聚集在一起,选出N个子数据集,然后把N个子数据集分别通过Duplex算法把近红外光谱数据分为训练集、验证集和测试集,利用训练集建立一系列成员回归模型,通过验证集选出模型预测性能最好时对应的模型及误差,运用验证集误差计算共识模型的权重,最后把成员模型对未知样品的预测结果用加权求和的方法结合起来,形成一个共识的结果。结果表明,大多数共识模型的预测性能要比单模型好,不仅提高了模型的预测精度,也增强了模型的稳定性。分析C-SOM-PLS、C-SOM-LS-SVM和各自成员模型的预测结果,发现有些共识建模的预测效果比成员模型差,研究表明,因为成员模型过拟合对共识模型产生了影响。为了降低过拟合对模型的影响,本文在共识模型中引入了模型集群分析(MPA),该算法实现需要三步,第一,通过蒙特卡洛采样获取子数据集;第二,针对每一个子数据集建立一个子模型;第三,从样本空间对所有建立的集群子模型的参数进行统计分析,获取有用信息。结果表明引入MPA能够很好的降低过拟合对共识模型的影响。
[Abstract]:Data modeling is an important part of chemometrics. According to the task of data modeling, it can be divided into quantitative analysis and qualitative analysis. At present, single model modeling is a commonly used method in data modeling, that is, in the process of repeatedly analyzing and measuring data, a series of prediction models are established, and a model with the best prediction performance is selected. However, thousands of analysis channels of modern high-throughput analysis instruments provide abundant measurement data for measuring samples. However, the problems of small samples and many variables are often encountered, so it is difficult to use single model method to meet the requirements of modeling. In order to make up for the shortage of single model modeling method, in recent years, multi-model consensus modeling has been widely studied and applied in many fields. Consensus modeling is to establish multi-member models through some modeling method. In order to improve the accuracy and reliability of the model, a consensus strategy is used to predict the unknown samples with several member models. In this paper, the consensus modeling method is applied to the near infrared spectral data, and the linear consensus member model and the nonlinear consensus member model are discussed. The main contents are as follows: the background and significance of the selected topic are introduced. The basic principle of data modeling and the modeling method applied in this paper are analyzed. The consensus modeling method of variable selection multiple regression member model is studied, and the advantages of variable selection are analyzed. A consensus model based on partial least squares (C-SOM-PLS) and a consensus model based on least squares support vector machine (C-SOM-LS-SVM) are proposed, which are linear multi-member consensus model and nonlinear multi-member consensus model respectively. The modeling method is to select the variables by Kohonen self-organizing feature mapping network (SOM) clustering algorithm, so that the similar variables gather together and select N subdatasets. Then N subdatasets are divided into training set, verification set and test set by Duplex algorithm, and a series of member regression models are built by training set. Through the verification set, the model and error corresponding to the best prediction performance are selected, and the weight of the consensus model is calculated by using the validation set error. Finally, the prediction results of the member model for unknown samples are combined with the weighted summation method. The result of forming a consensus. The results show that the prediction performance of most consensus models is better than that of single model, which not only improves the prediction accuracy of the model, but also enhances the stability of the model. By analyzing the prediction results of C-SOM-PLSS-SVM and their member models, it is found that some consensus models are less effective than the member models. In order to reduce the influence of over-fitting on the model, this paper introduces the model cluster analysis (MPA) into the consensus model. A submodel is established for each subdataset. Thirdly, the parameters of all the established cluster submodels are statistically analyzed from the sample space to obtain useful information. The results show that MPA can reduce the influence of over-fitting on consensus model.
【学位授予单位】：温州大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：O657.33

【参考文献】