基于信息粒化的特征选择算法研究

发布时间：2018-04-20 03:35

本文选题：特征选择 + 信息粒化　；参考：《闽南师范大学》2016年硕士论文

【摘要】：特征选择作为数据预处理的关键手段,是数据挖掘、模式识别和机器学习等领域的重要研究课题之一。它是指在原始数据中删除大量无关和冗余的特征,找到一组包含原始特征空间的全部或大部分分类信息的特征子集的过程。对于高维数据,借鉴表征整体的思想,将数据集由一个大信息粒细化为多个可有效表征其整体的小信息粒,有助于从多层次、多视觉分析数据。因此,本文利用信息粒化的表征机制,将其运用于特征选择中,并构造了一系列的基于信息粒化的特征选择模型。本文首先介绍特征选择问题的研究现状,重点讨论了邻域粒化,大间隔和局部子空间模型。然后,针对数据中冗余和无关特征的消除问题,以粒化为基础,分别从样本粒化、特征粒化以及样本特征双重粒化三个角度,展开一系列的研究来解决不同的数据分类预测问题,本文主要的研究成果有:(1)从样本粒化角度出发,结合特征本身具有质量这一情况,提出了基于特征质量的特征选择算法。该算法根据信息熵和大间隔分别定义了特征质量和最近邻,并利用该近邻实现了样本的粒化。实验从特征子集的紧凑性,分类精度,以及分类精度随着特征数目的变化情况这三方面对模型进行了验证,结果表明基于特征质量可以选择一组有效的特征子集。(2)从样本粒化角度出发,采用邻域关系,提出了基于最大近邻粗糙逼近的特征选择算法MNNRS。该算法以邻域粗糙集的特征选择算法NRS为框架,利用大间隔定义了最大近邻来粒化样本,并修正了正域的计算方法。MNNRS算法保留了NRS算法的优点,且有效降低了计算复杂性,提高了算法的分类性能。(3)从特征粒化角度出发,针对多标记数据集的高维性和标记与特征之间存在的类属关系,提出了基于局部子空间的多标记特征选择算法。该算法以局部子空间模型为基础,结合信息熵理论,鉴别了多标记中对标记集合相对次要,但却不可遗漏的特征。实验表明该算法能有效降低计算复杂性,提高分类性能,增强选择策略的灵活性。(4)从样本粒化和特征粒化角度出发,针对高维小样本数据存在高维性和易导致过拟合的问题,提出了一种启发式的局部随机特征选择方法。该算法利用局部子空间模型来粒化特征,结合样本的邻域粒化,以提高分类模型的分类精度,降低计算代价,并在一定程度上解决了过拟合问题。
[Abstract]:As a key means of data preprocessing, feature selection is one of the important research topics in data mining, pattern recognition and machine learning. It refers to the process of removing a large number of irrelevant and redundant features from the original data and finding a set of feature subsets containing all or most of the classification information in the original feature space. For high-dimensional data, using the idea of representing the whole, the data set is refined from one large information particle to several small information grains that can effectively represent the whole of the data, which is helpful to analyze the data from multi-level and multi-vision. Therefore, this paper uses the representation mechanism of information granulation, applies it to feature selection, and constructs a series of feature selection models based on information granulation. In this paper, the current situation of feature selection is introduced, and the models of neighborhood granulation, large spacing and local subspace are discussed. Then, aiming at the problem of eliminating redundant and irrelevant features in the data, the granulation is based on three aspects: sample granulation, feature granulation and sample feature double granulation. A series of studies have been carried out to solve the problem of different data classification and prediction. In this paper, a feature selection algorithm based on feature quality is proposed from the point of view of sample granulation and considering the fact that the feature itself has quality. The feature quality and nearest neighbor are defined according to information entropy and large interval, respectively, and the granulation of samples is realized by using the nearest neighbor. The experiment verifies the model from three aspects: the compactness of feature subset, the classification accuracy, and the variation of classification accuracy with the number of features. The results show that a set of effective feature subsets can be selected based on feature quality. From the point of view of sample granulation, a feature selection algorithm based on maximum nearest neighbor rough approximation (MNNRS) is proposed. Based on the feature selection algorithm of neighborhood rough sets (NRS), this algorithm defines the maximum nearest neighbor granulated samples with large intervals, and modifies the positive domain computing method. MNNRS algorithm retains the advantages of NRS algorithm and reduces the computational complexity effectively. The classification performance of the algorithm is improved. (3) from the point of view of feature granulation, a multi-label feature selection algorithm based on local subspace is proposed in view of the high dimension of multi-label data set and the class relationship between label and feature. Based on the local subspace model and the information entropy theory, the algorithm identifies the features of multiple markers which are relatively secondary to the set of markers, but which cannot be omitted. Experiments show that the algorithm can effectively reduce the computational complexity, improve the classification performance, enhance the flexibility of the selection strategy. (4) from the point of view of sample granulation and feature granulation, the algorithm can solve the problems of high dimension and easy over-fitting of high-dimensional and small-sample data. A heuristic local random feature selection method is proposed. The algorithm uses local subspace model to granulate the feature and combines the neighborhood granulation of the sample to improve the classification accuracy of the classification model and reduce the computational cost and solve the problem of over-fitting to a certain extent.
【学位授予单位】：闽南师范大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP18;TP311.13

【相似文献】