大数据下的数据选择与学习算法研究

发布时间：2018-01-03 04:28

本文关键词：大数据下的数据选择与学习算法研究　出处：《西安电子科技大学》2015年博士论文　论文类型：学位论文

【摘要】：信息爆炸时代给我们带来了无论种类还是数量上都空前巨大的信息。随着计算机通信与互联网技术、各种传感器所带来的物联网技术的极速发展与广泛应用,大量数据的收集变得非常容易且成本低廉。这为人工智能领域中迫切需求的机器学习、模式识别与计算机视觉的快速发展提供了必要的数据支撑。然而,如何有效地选择数据,如何从数据中学习有用的信息,成为摆在科研人员面前的重要问题。本文围绕数据选择和数据内在子空间和流形信息学习等问题通过模型建立、算法设计和分析等方面进行了系统性的研究,并将相关算法应用于协同过滤、图像修补和视频背景建模等工程领域。本论文的研究成果有:1.针对海量数据的人工标记需要花费高昂的人力和时间成本,主动学习作为一种适宜的最小化标记成本的方法被越来越多的研究者所关注。在已有的主动学习算法中,有的方法利用了未标记数据的结构信息,但代表数据点的选择需要额外的计算,例如层次聚类;有的方法需要每次迭代预先训练多个分类器,从集成的角度找出需要人工标记的数据;有的方法仅仅考虑每次迭代中最靠近最优决策面的数据。为了克服上面的不足,我们提出了一种成对K近邻伪剪辑的主动学习算法。该方法受K近邻剪辑预处理思想的启发,并且在每次迭代中仅需要训练一个分类器和考虑最优分类超平面附近的多个数据。同时,我们也给出了相应的算法复杂度分析和参数分析。大量的实验结果表明了本章提出的成对K近邻伪剪辑的主动学习算法相对于其他主流的主动学习算法在仅需查询并标记少量样本下就能获得较好的分类性能。2.低秩矩阵填充与恢复问题是典型的从已知数据中学习其内在结构和信息的实际问题。最近几年,这个问题在数据池环境中通过矩阵的迹范数最小化技术或其他奇异值分解的变种方法得到了很好的解决。在这种环境中,海量数据的规模、样本的大小和视频帧数等都是提前获得的。所以前面的问题能够通过在每次迭代中对数据(稀疏)矩阵进行奇异值分解来解决,但时间复杂度非常高,因此这类方法并不适合应用于实时的环境中。为了能实时的对视频流进行背景建模,本文提出了一种-范数框架下基于Grassmannian流形的在线梯度下降算法模型。应用该模型,能在数据流的环境中在线的解决矩阵填充与恢复问题。通过引入黎曼流形优化,沿着Grassmannian流形测地线的最优子空间能够被找到。作为增量学习,在每次迭代中只涉及一个数据样本(向量)的计算。-范数框架的设计是为了能从被稀疏大噪声(局外值)和高斯噪声污染的数据中逼近恢复原始数据。基于乘子交替方向法和grassmannian流形优化的一种迭代算法被提出以解决在线环境下的鲁棒低秩矩阵填充、鲁棒低秩矩阵恢复以及视频监控中的背景建模等问题。此外,一种新颖的自适应步长策略被提出来有效地追踪子空间的变化。大量的人工和实际数据的实验表明,本文的方法与其他主流的算法相比拥有更好的鲁棒性和有效性。3.从已知数据中学习其内在的子空间信息可以被推广到学习其满秩矩阵分解背后的黎曼商流形结构,其中低秩约束可以通过满秩矩阵分解来表示。为了能解决更一般的矩阵填充问题,这其中包括病态矩阵和大规模矩阵,本文从测度的角度分析了现有的主流黎曼流形优化算法,并首次根据黎曼几何结构和目标函数的尺度信息在黎曼商流形切空间的水平子空间上构造一种新颖的黎曼测度。在黎曼商流形上优化所需的必要组件被重新设计和计算。为了验证所构造的黎曼测度的有效性,在黎曼商流形上的非线性共轭梯度法被采用。大量的数值实验表明,通过比较算法的收敛性,本文提出的黎曼测度优于现有的黎曼测度。采用这种新颖黎曼测度的非线性共轭梯度算法在收敛性上优于主流的低秩矩阵填充算法。4.通过结合多个个体分类器来改善单个分类器的性能近几年越来越成为一个研究热点。随之而来的问题就是在产生的众多个体分类器中是否都对降低集成系统的泛化误差有益。平衡个体分类器之间的差异和个体分类器自身的准确率,这本身就是设计集成学习算法的出发点同时也是难点。因此,本文提出了一种基于整数矩阵分解的选择集成算法。该算法分别从差异性和准确率两个因素出发,为了增加个体分类器之间的差异,将个体分类器的预测标记作为原始目标,且将正确标记引入,以此构造一个代表个体分类器的整数矩阵,通过对该矩阵进行分解获得个体分类器的投影方向,最终获得新的个体。然而,为了保证变换个体的性能,采用标准的性能判别准则去除集成中性能较差的个体。最后,通过雷达一维距离像的实验结果表明该算法有效地平衡了个体间差异性和个体自身的准确率这两个因素,相比单个分类器和其他集成方法,该方法提高了对雷达目标的识别准确率。5.针对在一个有监督学习任务中,如果目标域训练样本的数量非常稀少,这势必产生影响目标域中分类器学习和推广性能的问题。为了解决这个问题,除了使用主动学习的方法从目标域选择富含信息的样本并给与标记以增大训练样本外,在某些真实环境中往往已经存在另一些有标记的样本,且其获取相比目标域的训练样本更加容易,但是这些样本却与目标域的样本具有不同的数据分布形式,这些具有不同分布的有标记样本构成源域。因此,迁移学习被引入来处理目标域训练样本稀少的这类分类问题。我们提出了两种新的迁移学习算法:第一种是基于旋转森林空间变换的迁移学习算法,该算法通过旋转森林空间变换将源域样本向目标域形成的空间进行投影,通过测量变换后源域样本和目标域样本的相似度来选择可利用的源域样本帮助目标域中分类器的学习。通过文本数据的分类实验表明,该章所提算法相比其他算法获得了更好的分类性能。第二种为基于数据驱动的线性空间映射迁移集成算法。在该算法中,通过将源域的样本向目标域中容易被错分的样本空间进行投影变换,从而选择出对目标域分类有帮助的样本加入到目标域,改善其分类性能。特别地,为了更加有效地选择源域样本,本文将源域样本进行随机划分,并分别对于每个子集进行投影变换,然后结合每个子集获得的结果。对于UCI数据和合成孔径雷达目标图像数据的分类实验表明本章提出的算法相比其他算法有效地提高了目标域的分类性能,且改善了单个迁移的不稳定性。
[Abstract]:The era of information explosion brought regardless of the type or quantity are unprecedented information for us. With the development of computer communication and Internet technology, the rapid development and wide application of Internet technology brings a variety of sensors, a large collection of data becomes very easy and low cost. This study is the urgent needs in the field of artificial intelligence machine, provide the necessary data to support the rapid development of computer vision and pattern recognition. However, how to choose the data effectively, how to learn the useful information from the data, has become an important issue in the research workers. This paper focuses on the data selection and data subspace and intrinsic information manifold learning problem through the model of system the algorithm design and analysis, and the application of the relevant algorithm in collaborative filtering, image inpainting and video background modeling engineering. Domain. The research results of this thesis are: 1. for mass data manual marking takes time and manpower cost, active learning as a method of minimizing the cost of suitable markers is concerned by more and more researchers. In the existing active learning algorithm, a method of using unlabeled data structure but, on behalf of data point selection requires additional computation, such as hierarchical clustering; some methods need each iteration pre training multiple classifiers, identify artificial markers data from the point of view of integration; some methods only consider each iteration closest to the optimal decision surface data. In order to overcome the shortcomings above, we propose an active learning algorithm a pair of clips. The pseudo K nearest neighbor method K nearest neighbor heuristic clip pretreatment thought, and only need to train a classifier and in each iteration A plurality of data considering the optimal hyperplane nearby. At the same time, we also give the corresponding algorithm complexity analysis and parameter analysis. Experimental results demonstrate that the pairwise K nearest neighbor pseudo clips of the proposed active learning algorithm with respect to other mainstream active learning algorithm only needs to query and mark can obtain a small sample the classification performance of.2. low rank matrix recovery is better filled with typical examples from the known data to study its internal structure and information. In recent years, this problem by trace norm minimization technique of matrix singular value decomposition method or other variants was solved in the data pool in this environment. In the environment of massive data, the size of the sample size and the video frames are obtained in advance. So in front of the problem can be passed in each iteration of the data (sparse) Matrix singular value decomposition to solve, but the time complexity is very high, so this kind of method is not suitable for real-time environment. In order to real-time video stream on the background modeling, this paper proposes a framework of Grassmannian - norm online gradient descent algorithm based on manifold model. This model is used to solve the matrix can online in the data stream environment in filling and recovery. By introducing the Riemann manifold optimization, along Grassmannian manifold geodesic optimal subspace can be found. As incremental learning, only a data sample involved in each iteration (vector) design calculation. - norm framework was to be from large sparse (outside noise value) and the Gauss noise pollution data approach to recover the original data. By alternating direction method and Grassmannian manifold optimization of an iterative algorithm is proposed to solution based on Is the low rank matrix robust online environment filling, the problem of robust low rank matrix recovery and video monitoring in background modeling. In addition, a novel adaptive step strategy is proposed to effectively change tracking subspace. The artificial and real data show that a large number of experiments, this method with other algorithms compared with better robustness and effectiveness of.3. from known data to study its internal space information can be extended to the Riemann manifold structure learning the full rank decomposition of matrix behind, the low rank constraint can be represented by full rank matrix decomposition. In order to solve the more general problem of filling matrix, which including the ill conditioned matrix and mass matrix, this paper analyzes the mainstream Riemann manifold existing optimization algorithms from the angle of measure, and for the first time according to the Riemann scale information geometry and objective function A novel Riemann measure constructed in Riemann flow shape tangent space level subspace. Optimizing the necessary components required in Riemann manifolds are re designed and calculated. The validity of the Riemann measure in order to verify the structure of the nonlinear conjugate gradient method in Riemann manifold is adopted. Numerical experiments show that a large number of the convergence of the algorithm, by comparison, the Riemann measure is superior to the existing Riemann measure. The performance of this novel nonlinear conjugate gradient algorithm of the Riemann measure of the convergence of low rank matrix is better than that of the mainstream.4. filling algorithm by combining a plurality of individual classifiers to improve single classifier in recent years has become a more and more the focus of research. The problem is in many individual classifier produced is beneficial to reduce the generalization error of integrated system. The balance between individual classifiers The accuracy of individual differences and the classifier itself, the starting point itself is the design of integrated learning algorithm is also difficult. Therefore, this paper proposes an integrated algorithm for integer matrix decomposition based selection. The algorithm separately from the difference and accuracy of two elements, in order to increase the difference between individual classifiers, forecast mark the individual classifier as the original target, and will be marked correctly introduced for constructing a representative individual classifier based on the integer matrix, the matrix decomposition of projection direction to obtain the individual classifier, finally get the new individual. However, in order to ensure the performance of transformation of individuals, using standard criteria for the removal performance of integrated performance is poor individual. Finally, shows that the algorithm can effectively balance the difference between individual and individual through accurate radar range profile of the experimental results The rate of these two factors, compared with the single classifier and other integration methods, this method improves the recognition accuracy of the radar target.5. in a supervised learning task, if the number of training samples of the target domain is very scarce, it is bound to have an impact in the target domain classifier learning and generalization performance. In order to solve this problem. In addition, the use of active learning methods from the target domain selection information rich samples and give marks to increase training samples, in some real environment may exist in some labeled samples, and the obtained compared to the target domain training samples more easily, but these are the sample and target domain samples with different distribution of the data, which have different distribution of labeled samples to form the source domain. Therefore, transfer learning is introduced to deal with the target domain training sample rare such Class problem. We propose two new algorithms of transfer learning: the first is the learning algorithm based on the spatial migration of rotation forest transform, the algorithm through the space rotation forest transform the source domain to the target domain formation sample space projection, similarity of source domain and target domain sample samples by measuring the transformation to select the source domain the sample can be used to help the target domain classifier learning. Through the experiment of text data classification show that the proposed algorithm has better classification performance than other algorithms. For second kinds of linear space mapping algorithm based on integrated data driven migration. In this algorithm, the source domain to the target domain in the sample easy to be misclassified sample space projection transformation, and find out the target domain classification help sample is added to the target domain, improve the classification performance. In particular, in order to more effectively Select the source domain sample, the source domain samples were randomly divided, and separately for each subset of projection transformation, and then combined with the results obtained for each subset. The experimental data of UCI and synthetic aperture radar target image data classification show that the algorithm proposed in this chapter compared to other algorithms can effectively improve the classification performance of the target domain and, to improve the individual migration instability.

【学位授予单位】：西安电子科技大学
【学位级别】：博士
【学位授予年份】：2015
【分类号】：TP181

【共引文献】