特征选择与特征学习算法研究

发布时间：2018-03-11 06:51

本文选题：特征选择　切入点：特征学习　出处：《中国科学技术大学》2017年硕士论文　论文类型：学位论文

【摘要】：信息时代的到来导致在机器学习过程中,学习数据趋向于更大规模,更高维度,并且具有复杂噪声,从而给模型的训练与学习提出了挑战。因此,充分地对数据进行分析挖掘,提取出数据中的关键特征和潜在信息,具有重要的研究价值和意义。本文分别从特征选择和特征学习两个方面进行研究。特征选择旨在从数据中利用一定策略选择出原始特征集的一个最优子集。现有的特征选择算法,主要通过考虑特征与目标任务的相关度评估特征的重要性。在监督学习中,考虑特征与目标的统计相关性,在无监督学习中,根据特征与样本结构的契合程度,评估特征在样本集上的区分能力。除了考虑特征的相关度,本文提出结合特征相关度和冗余度的特征选择算法FSIR2,算法基于谱特征选择理论评估特征的相关度,同时考虑特征集内部的冗余度,通过最大化特征与目标的相关度,最小化特征之间的冗余度,进而确定最优特征子集。算法适用于监督学习和无监督学习两种条件。不同于特征选择,特征学习致力于将原始特征集映射到新的特征空间,学习数据的最优表达。现有的特征学习算法主要分为传统学习算法和基于神经网络的算法两种。目前,大量工作基于卷积神经网络、递归神经网络等进行监督特征学习,而充分利用大量低成本的的无监督数据进行特征学习的研究工作不足。本文提出基于卷积自编码网络的特征学习算法SoundAutoEncoder。算法针对视频数据中的音频数据进行无监督的特征学习,一方面利用卷积自编码网络,以充分挖掘音频数据中的有效信息进行特征学习;另一方面利用视频数据中图像数据与音频数据的天然一致性,通过完善的视觉识别模型提取图像数据中的语义信息,进而指导音频数据的特征学习过程。对于FSIR2算法,本文在监督学习和无监督学习两种条件下,在10个数据集上进行实验,测试其所选特征集上的分类、聚类准确率,以及特征之间的冗余度。在与相对表现最好的算法MCFS的对比中,FSIR2算法在聚类结果的准确率、NMI上提升了 4%,在冗余度上降低了 5%,在分类结果的准确率上与MCFS表现相当。对于SoundAutoEncoder算法,本文通过在三个数据集上的场景分类实验,测试其特征学习能力。在分类准确率的比较中,SoundAutoEncoder相比 SoundNet 算法在 DCASE-2016、ESC-10、ESC-50 数据集上分别提升了 0.6%、6.9%、6.3%。
[Abstract]:The advent of the information era resulted in the machine learning process, learning data tend to be larger, the higher dimensions, and it has a complex noise, thus presenting a challenge to the training and learning model. Therefore, analysis of data mining to fully extract the key features in the data and potential information, has important research the value and significance of this paper. Then the feature selection and feature learning to study two aspects. Feature selection is to use some strategies from the data to select an optimal subset of the original feature set. The existing feature selection algorithms, mainly through the importance of the correlation evaluation considering the characteristics features and the target task. In supervised learning, considering the statistical correlation between features and goals, in unsupervised learning, according to the characteristics and structure of the sample fit degree, ability of distinguishing feature in the evaluation sample set. In addition to consider Correlation feature selection algorithm is proposed in this paper, combined with the characteristics of FSIR2 correlation and redundancy, correlation feature selection algorithm based on spectral theory to evaluate characteristics, considering the feature set of redundant internal correlation, by maximizing the features and the target, the redundancy between the minimum feature, and then determine the best feature a subset of algorithm for supervised learning and unsupervised learning conditions. Different from the two kinds of feature selection, feature learning will be dedicated to mapping the original feature set into a new feature space, learning optimal data expression. The existing feature learning algorithm is mainly divided into the traditional learning algorithm and the algorithm based on neural network two. At present, a large number of based on convolutional neural network, recurrent neural networks for supervised feature learning, and make full use of a large number of low cost unsupervised data for the study of job characteristics learning deficiency. In this paper based on the characteristics of self convolution encoding network learning algorithm SoundAutoEncoder. audio data for the video data in the feature of unsupervised learning, on the one hand, using convolution encoding from the network, in order to fully exploit the effective information in audio data for feature learning; using natural consistency of image data and audio data in the video data on the other hand semantic information extraction, image data through visual identification model is perfect, and then guide the characteristics of audio data in the learning process. For the FSIR2 algorithm, based on supervised learning and unsupervised learning under the two conditions, experiments were performed on 10 data sets, the selected feature classification, clustering accuracy. And the redundancy between features. In contrast with the relatively best performance of the MCFS algorithm, the accuracy of FSIR2 algorithm in clustering results, NMI 4% increase in Redundancy is reduced by 5%, and the MCFS performance is quite in the accuracy of the classification results. For SoundAutoEncoder algorithm, this paper through the scene classification experiments on three data sets, to test the characteristics of learning ability. More accurate rate in classification, compared to SoundAutoEncoder SoundNet algorithm in DCASE-2016, ESC-10, ESC-50 data set. Up to 0.6%, 6.9%, 6.3%.

【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18

【相似文献】