基于冗余—互补散度及特征包络前沿的数据驱动特征选择方法研究

发布时间：2018-05-30 08:44

本文选题：数据驱动 + 特征选择　；参考：《华中科技大学》2016年博士论文

【摘要】：随着社会的不断发展,数据的构成呈现复杂化与高维化的趋势,大数据降维研究中应用广泛的特征选择算法已经成为大数据和数据驱动背景下社会经济决策和企业商务决策重要的研究方向。特征选择方法中的参数选择问题对所选特征质量以及数据的再表达有着重要的影响。特征集合S=F1,...,Fk和类C的联合互信息可以展开为不同维度(阶)上特征与类的交互信息的和,于是,特征集合与类之间的联合互信息可以表现为交互信息的展开形式。从(2012)的视角来看,参数的确定问题也即选择特征选择方法的方法问题,但在这些经典特征选择方法中,存在先验性的参数选择问题,例如MIFS中冗余性权重口等。因此,如何从弥补高阶交互项缺失的视角来寻找合适的、非先验性的权重是特征选择的一个重大问题。给出了两个如何解决特征选择参数问题的框架。其一,从数据驱动的视角,将参数的衍生视为对高阶交互信息的省略所造成的偏差的修正。在给出了数据驱动的基于互信息的特征评价框架的基础上,深入分析了由高阶信息缺失所带来的冗余-互补分散现象,在冗余-互补维度上引入高阶信息驱动的修正因子对低阶冗余-互补项进行修正(参数的确定),进而对特征进行准确地评价与排序。其二,结合特征选择中多指标评价及指标权重的多样性及其不同领域不同时段的偏向性,构建了一种基于DEA的特征选择框架,该框架充分利用了DEA框架的数据驱动特性,使其在进行特征评价和选择时能够充分考虑到特征间关系多样性以及特征评价准则多样性特点,同时还能应对不同数据环境所带来的变化。依据第一个框架,从省略高阶交互信息所造成的冗余-互补分散现象出发实现特征选择参数的确定。对由高阶信息缺失所带来的冗余-互补分散现象进行了深入探讨,基于高阶互信息在低阶的“投影”视角,从高阶互信息缺失在低阶上的“投影”所造成的低阶上特征间的冗余-互补分散现象进行判断,并据此进行低阶项参数的确定;进而提出了基于冗余-互补散度的数据驱动特征选择方法(Redundancy-Complementariness Dispersion-based Feature Selection method, RCDFS),该算法考虑到现有统计方法对高阶项的估计存在不可预料的错误,通过数据驱动的方式为2阶近似特征冗余-互补关系给出一个系数(权重),对该项因高阶项缺失所带来的偏差给予了恰当的弥补。证明了采用“求平均”方法的特征评价准则可以保证获取高阶冗余性和互补性的下界,为有效的数据驱动特征评价准则整合方法打下了基础。鉴于不同背景所对应的评价准则及特征关联偏向的“先验知识”蕴藏于该背景下的具体数据之中,于是根据给出的第二个框架,构建了用于特征选择的基于DEA的超效率特征评价模型。该模型可面向不同领域的具体数据,通过超效率DEA对这些评价准则选择合适的参数并构造出相应的超效率包络前沿,进而实现对特征的评价和排序。同时还给出了相应的求解MCSD算法,讨论了算法的复杂性。实验结果表明,所提MCSD算法所对应的分类结果在绝大多数情况下显著优于IG、ReliefF、CMIM和JMI的结果。快速发展的公路运输业带来了交通事故的持续增长。驾驶员的不良驾驶行为是一些重大交通事故的诱因,因此通过动态监控数据进行驾驶员异常驾驶行为的辨识与分析,特别是对于一些需要重点监控的异常驾驶行为的识别与分析,意义十分重大。根据Wright等(2009)和Mo等(2014)的理论,任何一条新的车辆运动轨迹都可以近似的用训练车辆运动轨迹线性组合而成,因此,稀疏重构技术可以被应用于轨迹识别与行为分类中。考虑到大量冗余车辆轨迹特征的存在会对轨迹学习模型的准确性造成严重的影响,同时基于稀疏重构轨迹学习模型在求解速率上的短板更是彰显了特征选择在建模和处理过程中的重要性。鉴于此,在l2-lp稀疏重构方法的轨迹识别模型中嵌入了特征选择方法,并采用前面所提出的数据驱动特征选择算法予以实现：提出了求解基于lp(0p1)范数的稀疏重构系数向量的方法Orthogonal Matching Pursuit-quasi-Newton (OMPN),该方法首先采用正交匹配贪婪算法(Orthogonal Matching Pursuit, OMP)搜索出一个初始可行解,然后采用拟牛顿法进一步搜索稀疏解。最后,根据lp(0p1)范数稀疏问题的局部最优解在一定的条件下与其精确解的关系来最终获取更加稀疏的解。实验结果表明了所提出的框架和方法效果的优越性。同时,实验结果也显示了嵌入特征选择后的结果要优于没有嵌入特征选择方法时的结果,表明了所提数据驱动的特征选择方法在交通安全管理领域中有着重要的理论意义和广阔的应用空间。
[Abstract]:With the continuous development of the society, the composition of data presents a trend of complexity and high maintenance. The widely used feature selection algorithm in the large data reduction research has become an important research direction in the social and economic decision-making and business decision making under the background of large data and data driven. The quality and the re expression of data have an important influence. The joint mutual information of the feature set S=F1, Fk and the class C can be expanded to the sum of the interactive information of the characteristics and classes on the different dimension (order). Therefore, the joint mutual information between the feature set and the class can be shown as the expansion of the interactive information. From the perspective of (2012), the parameters The problem of determining the problem is the method of selecting the feature selection method, but in these classical feature selection methods, there is a priori parameter selection problem, such as the redundant weighting mouth in MIFS. Therefore, it is a major problem to find the right non priori weight from the perspective of making up the missing of the high order interaction. Two frameworks to solve the problem of characteristic selection parameters are given. First, from the data driven perspective, the derivation of the parameters is considered as a correction of the deviation caused by the ellipsis of high order interactive information. On the basis of a data driven feature evaluation framework based on mutual information, an in-depth analysis is made of the lack of high order information. The redundant complementary dispersion phenomenon is introduced into the redundancy complementary dimension by introducing the high order information driven correction factor to the low order redundancy complementary term (parameter determination), and then the characteristics are accurately evaluated and ordered. Secondly, the multiple index evaluation and the diversity of the index weight and the deviation of different periods in different fields are combined. In nature, a feature selection framework based on DEA is constructed. The framework makes full use of the data driven characteristics of the DEA framework so that it can take full account of the diversity of features and the diversity of feature evaluation criteria when evaluating and selecting the features, and can also bring about changes to different data environments. Based on the redundancy and complementary dispersion caused by the ellipsis of high order interactive information, a framework is used to determine the feature selection parameters. The redundant complementary dispersion, which is caused by the absence of high order information, is deeply discussed. Based on the high order mutual information in the low order "projection" perspective, the "projection" of high order mutual information is not in the lower order. "The redundant complementary dispersion phenomenon between low order upper features is judged, and the parameters of low order terms are determined accordingly. Then a data driven feature selection method based on redundancy complementary divergence (Redundancy-Complementariness Dispersion-based Feature Selection method, RCDFS) is proposed. The algorithm takes into account the existing statistics. In this method, there is an unpredictable error in the estimation of higher order terms. A coefficient (weight) is given for the 2 order approximate characteristic redundancy complementary relation by data driven method, which is properly compensated for the deviation caused by the absence of high order terms. The lower bound of redundancy and complementarity lays a foundation for effective integration of data driven feature evaluation criteria. In view of the corresponding evaluation criteria for different backgrounds and the "prior knowledge" of characteristic association bias in the specific data under this background, a basis for feature selection is constructed based on the second frameworks given. The model of DEA's super efficiency feature evaluation model. This model can be oriented to the specific data in different fields. Through the super efficiency DEA, the appropriate parameters are selected and the corresponding super efficiency envelope frontiers are constructed, then the evaluation and sorting of the characteristics are realized. At the same time, the corresponding solution of MCSD algorithm is given, and the complexity of the algorithm is discussed. The experimental results show that the classification results of the proposed MCSD algorithm are significantly better than the results of IG, ReliefF, CMIM and JMI in most cases. The rapid development of highway transportation brings about the continuous increase of traffic accidents. The driver's bad driving behavior is the cause of some important traffic accidents, so driving through dynamic monitoring data is carried out. The identification and analysis of the abnormal driving behavior of the driver, especially for the identification and analysis of some abnormal driving behaviors which need to be monitored and monitored, is of great significance. According to the theory of Wright (2009) and Mo (2014), any new vehicle trajectory can be approximated by a linear combination of the track of the training vehicle. Therefore, sparsity is sparse. Reconfiguration technology can be applied to trajectory recognition and behavior classification. Considering the existence of a large number of redundant vehicle trajectories, the accuracy of the trajectory learning model is seriously affected. At the same time, the short plate based on sparse reconstruction trajectory learning model shows the importance of feature selection in the process of modeling and processing. In view of this, the feature selection method is embedded in the trajectory recognition model of the l2-lp sparse reconstruction method, and the data driven feature selection algorithm proposed before is implemented. The method of solving the sparse reconstruction coefficient vector based on the LP (0p1) norm is proposed, Orthogonal Matching Pursuit-quasi-Newton (OMPN). Orthogonal Matching Pursuit (OMP) is used to find an initial feasible solution, and then the quasi Newton method is used to further search the sparse solution. Finally, the local optimal solution of the LP (0p1) norm sparsity problem is based on the relationship between the exact solution and the exact solution. The experimental results show that the solution is more sparse. At the same time, the experimental results also show that the result of the embedded feature selection is better than the result without the embedded feature selection method. It shows that the proposed data driven feature selection method has the important theoretical significance and wide application space in the field of traffic safety management.
【学位授予单位】：华中科技大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】