虚拟样本生成技术及建模应用研究

发布时间：2018-04-23 21:46

本文选题：小样本 + 虚拟样本生成　；参考：《北京化工大学》2017年博士论文

【摘要】：“大数据”时代,在很多领域,数据海量,知识贫乏,需要通过数据挖掘发现知识,数据驱动建模成为研究热点,而数据样本个数不充分、样本代表性不典型或者样本分布不均匀等严重制约数据驱动建模的质量。在大数据背景下,不可忽视的一个重要问题就是大数据、小样本问题。这个问题主要源于数据获取成本较高、或数据重复或发生概率较小等原因,致使面临有用数据有限。基于小样本如何进行有效建模是计算智能领域的一个重要研究方向,具有十分重要的理论研究意义和应用价值。解决小样本问题,目前学术界主要有基于灰色理论与机器学习的方法和生成虚拟样本的方法等两种途径。基于小样本数据产生新的有效数据是补充数据的一种有效方法,虚拟样本生成技术是解决小样本问题的重要研究方向。在大量文献阅读、归纳、总结的基础上,本文将针对监督式和非监督式机器学习算法所对应的标签数据和无标签数据的小样本问题,开展基于小样本的虚拟样本产生、优化和应用研究,以产生充足的有效数据集,进而开展神经网络结构和算法研究以提出数据驱动的智能建模新方法,并开展工程建设费用风险分析应用研究。本文的主要研究内容如下:(1)基于整体扩散技术的虚拟样本生成新方法。整体趋势扩散技术是一种有效的基于分布的虚拟样本生成技术,但现有技术只考虑了在原始样本区域和扩散区域采用同一种数据分布方法产生虚拟样本,并且增加虚拟输入属性使输入空间倍增。本文在此基础上,在已知小样本区域采用不均匀分布、在拓展区域采用均匀分布两种方式相结合,通过多分布整体扩散技术推估小样本属性可接受范围,同时为了不增加输入属性,不再求取隶属度函数值代表样本点发生的可能性作为模型的虚拟输入属性,由此形成了一种更有效的虚拟样本产生新机制,提出了一种新颖的多分布整体趋势扩散技术(MD-MTD)。通过标准函数和工业数据集验证了所提方法的有效性。(2)基于优化技术的虚拟样本生成新方法。为了解决虚拟样本的优化问题,在MD-MTD的基础上,本文提出了基于三角隶属函数的信息扩散方法(TMIE),进而提出了一种新的确定上下拓展区域界限的方法,基于改进的MD-MTD产生虚拟样本,采用PSO对所产生的输入属性的虚拟样本进行优化计算,获得更合适的虚拟样本,由此提出了 PSO-MD-MTD方法。通过标准函数和工业数据集验证了所提方法的有效性。(3)基于插值的虚拟样本生成新方法。基于分布的虚拟样本生成技术是基于小样本建立的模型,由此本文研究建立一种合理有效的基于小样本的神经网络模型,进而根据所建模型的线性和非线性结构特点进行虚拟样本的生成。为此,本文提出了一种极限学习机隐含层插值的虚拟样本生成方法(IVSG),对极限学习机隐含层的输出数据进行中值插值产生相应的虚拟样本,再由隐含层输出数据的虚拟样本前后反推输出层输出和输入层输入空间的虚拟数据。通过标准函数和工业数据集验证了所提方法的有效性,并对IVSG、PSO-MD-MTD和MD-MTD进行比较,分析不同方法的适用性。(4)基于偏最小二乘法的函数连接神经网络建模新方法。在解决数据样本有效性问题的基础上,利用数据驱动建模思想来挖掘数据背后隐藏的知识就是一项十分重要的工作。为了有效解决函数连接神经网络中共线性数据问题和有效地挖掘有限数据背后的知识信息,本文结合极限学习机模型,提出采用偏最小二乘学习算法取代函数连接神经网络原模型误差反向传播算法来求取模型参数,由此提出了一种基于偏最小二乘学习算法的函数连接神经网络模型(PLSR-FLNN),通过两个工业实例数据集验证了所提方法的有效性,与其它四种建模方法比较验证了所提方法的先进性。(5)基于蒙特卡洛方法扩充样本实现工程建设费用风险分析与评估。在解决监督学习中数据和建模问题的基础上,本文针对非监督学习中的数据问题开展研究工作。重点探讨Monte Carlo在工程建设费用风险分析中的不确定性小样本问题,提出基于蒙特卡洛模拟的样本补充方法,在此基础上,根据数据样本估计费用项的概率分布和概率密度函数,同时采用蒙特卡洛模拟和市场因素驱动,并结合李克特量表分析法,对各影响因素进行综合分析与评价,由此提出一种实用的工程建设费用风险分析方法,通过实际工程案例验证了所提方法的有效性。
[Abstract]:In the era of "big data", in many fields, data is huge, knowledge is poor, and knowledge is needed through data mining. Data driven modeling has become a hot topic, but the number of data samples is not sufficient, the representative of sample is not typical or the distribution of sample is not uniform, and the quality of data driven modeling is seriously restricted. In large data background, it can not be ignored. One of the important problems is large data, small sample problem. This problem is mainly due to the high cost of data acquisition, or the low probability of data repetition or small occurrence, which leads to the limited availability of useful data. It is an important research direction in the field of computing intelligence based on how to make effective modeling based on small samples. In order to solve the problem of small sample, there are two ways in the academic circle, which are based on the method of grey theory and machine learning and the method of generating virtual sample. It is an effective method to produce new effective data based on small sample data, and the virtual sample generation technology is important to solve the small sample problem. On the basis of a large number of literature reading, induction and summary, this paper will launch a small sample based virtual sample generation, optimization and application research to produce sufficient and effective data sets to develop a neural network, based on the small sample problem of the label data and unlabeled data corresponding to the supervised and unsupervised machine learning algorithms. The research of network structure and algorithm is a new method of data driven intelligent modeling, and the research of engineering construction cost risk analysis is carried out. The main contents of this paper are as follows: (1) a new method of virtual sample generation based on the whole diffusion technology. The existing technology only considers the use of the same data distribution method in the original sample area and the diffusion region to generate virtual samples, and increase the virtual input attribute to multiplier the input space. On this basis, the inhomogeneous distribution is adopted in the known small sample regions, and the two ways of uniform distribution are combined in the extended region through the multiple points. The whole diffusion technology estimates the acceptable range of the small sample attributes. At the same time, in order to not increase the input attribute, the possibility of the membership degree function is no longer to represent the possibility of the sample point as the virtual input attribute of the model, thus a more effective new mechanism of virtual sample generation is formed, and a novel multi distribution overall trend expansion is proposed. MD-MTD. The validity of the proposed method is verified through standard functions and industrial data sets. (2) a new method of virtual sample generation based on optimization technology is created. In order to solve the optimization problem of virtual samples, based on MD-MTD, this paper proposes a method of information diffusion based on trigonometric membership function (TMIE), and then proposes a new kind of method. The method of setting up and down region boundaries is based on the virtual sample produced by the improved MD-MTD. The virtual sample of the input attributes generated by PSO is optimized and the more appropriate virtual samples are obtained. Thus, the PSO-MD-MTD method is proposed. The validity of the proposed method is verified by the standard function and the industrial data set. (3) interpolation based on the method. The virtual sample generation method is a new method. The distributed virtual sample generation technology is based on the small sample model. In this paper, a reasonable and effective neural network model based on small sample is established, and then the pseudo sample is generated according to the linear and nonlinear structure characteristics of the model. The virtual sample generation method (IVSG) for the implicit layer interpolation of the learning machine is used to generate the corresponding virtual samples for the output data of the implicit layer of the limit learning machine, and then the output layer and the input layer virtual data in the input layer of the virtual sample of the hidden layer output data. The standard function and the industrial data collection are tested. The validity of the proposed method is proved, and IVSG, PSO-MD-MTD and MD-MTD are compared, and the applicability of different methods is analyzed. (4) a new method of modeling the neural network based on partial least square method is used. On the basis of solving the problem of data sample validity, the data driven modeling idea is used to excavate the hidden knowledge behind the data. In order to effectively solve the linear data problem of the function connection neural network and effectively excavate the knowledge information behind the finite data, a partial least square learning algorithm is proposed to replace the original model error back propagation algorithm of the function connection neural network to obtain the model reference. In this way, a function connection neural network model (PLSR-FLNN) based on partial least squares learning algorithm is proposed, and the effectiveness of the proposed method is verified by two industrial example data sets. Compared with the other four modeling methods, the advanced nature of the proposed method is verified. (5) the construction cost of the project is expanded by the Monte Carlo method. Using risk analysis and evaluation. On the basis of solving the problem of data and modeling in supervised learning, this paper carries out research work on data problems in unsupervised learning. This paper focuses on the small sample problem of Monte Carlo in the risk analysis of engineering construction costs, and proposes a sample supplement based on Monte Carlo simulation, which is based on this basis. At the same time, the probability distribution and probability density function of the cost item are estimated according to the data sample, and the Monte Carlo simulation and the market factor are used at the same time. Combined with the Li kte scale analysis method, the influence factors are synthetically analyzed and evaluated. A practical project construction cost risk analysis method is put forward, and the practical engineering case is adopted. The effectiveness of the proposed method is verified.

【学位授予单位】：北京化工大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP18;TP311.13

【参考文献】