两类自适应稀疏学习机及其在高维数据挖掘中的应用

发布时间：2018-06-02 03:18

本文选题：高维数据挖掘 + 群lasso　；参考：《河南师范大学》2017年硕士论文

【摘要】：随着现代高维数据的不断积累,以支持向量机为代表的传统统计学习方法不能很好地进行高维变量选择.发展新型的自适应稀疏学习机为进行高维数据挖掘提供了新的思路.为此本文有机结合统计学,系统生物学和信息论方法,发展了两种具有生物可解释性的自适应稀疏学习模型和求解算法,并将其分别应用到高维数据分析中,都获得了较好的分类和基因选择性能.本文的主要创新如下:(1)针对群lasso惩罚类方法处理二分类高维数据面临的提前变量分群,自适应的群内变量选择,生物可解释性等难题,我们致力于开展基于网络分析的变量分群策略和新型自适应惩罚机制研究,据此提出了融合网络分析和信息学理论方法的自适应稀疏群lasso.首先,将网络分析中的网络模块识别与群lasso中的变量分群有机联系起来,利用加权基因共表达网络分析方法辨识出具有良好生物交互关系的模块.其次,利用条件交互信息等信息论方法在每一个被划分的群内构建变量重要性的评价准则,据此构造具有生物可解释性的权重系数并将其添加到惩罚项的合适位置来自适应地进行变量选择.最后,在四种高维癌症生物数据上的结果验证了所提的自适应稀疏学习机能够有效地进行分类和群体基因选择.(2)针对群惩罚多项式回归处理多类分类高维数据中出现的自适应变量选择,生物可解释性等难题,我们提出了融合网络分析方法的稀疏多项式回归.通过结合生物学资源和基因表达谱信息,我们利用GeneRank构建了具有生物学意义的权重并引入到群lasso惩罚中,提出了一种新的自适应稀疏学习机.最终在酵母二次转化数据上的实验结果验证了所提的模型与其它模型相比取得了较好的分类和基因选择性能.
[Abstract]:With the accumulation of modern high-dimensional data, the traditional statistical learning method, represented by support vector machine, can not select high-dimensional variables well. The development of a new adaptive sparse learning machine provides a new idea for high dimensional data mining. In this paper, two biologically interpretable adaptive sparse learning models and solving algorithms are developed, which are combined with statistics, system biology and information theory, and are applied to high-dimensional data analysis. Good classification and gene selection performance were obtained. The main innovations of this paper are as follows: (1) to deal with the problems of early variable clustering, adaptive intra-group variable selection, biological interpretability and so on, which are faced with two-class high-dimensional data by lasso penalty class method. We focus on variable clustering strategy and new adaptive punishment mechanism based on network analysis. Based on this, we propose an adaptive sparse group lasso-based network analysis and informatics theory. Firstly, the identification of network modules in network analysis is associated with variable clustering in group lasso, and the modules with good biological interaction are identified by using weighted gene coexpression network analysis method. Secondly, using the information theory method such as conditional interactive information, we construct the evaluation criterion of the importance of variables in each divided group. Based on this, a biologically interpretable weight coefficient is constructed and added to the appropriate position of the penalty term to adaptively select variables. Finally, The results on four kinds of high-dimensional cancer biological data show that the proposed adaptive sparse learning machine can effectively classify and select population genes. Adaptive variable selection, In this paper, we propose a sparse polynomial regression method for fusion network analysis. By combining the information of biological resources and gene expression profiles, we use GeneRank to construct biologically significant weights and introduce them into group lasso punishment, and propose a new adaptive sparse learning machine. Finally, the experimental results on yeast secondary transformation data show that the proposed model has better classification and gene selection performance than other models.
【学位授予单位】：河南师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18;TP311.13

【参考文献】