distinguishable gene subset selection Pearson correlation co

发布时间：2016-09-09 09:06

本文关键词：基于统计相关性与K-means的区分基因子集选择算法，由笔耕文化传播整理发布。

基于统计相关性与K-means的区分基因子集选择算法

Statistical Correlation and K-Means Based Distinguishable Gene Subset Selection Algorithms

[1] [2]

XIE Juan-Ying, GAO Hong-Chao （School of Computer Science, Shaanxi Normal University, Xi＇an 710062, China）

陕西师范大学计算机科学学院,陕西西安710062

文章摘要：针对高维小样本癌症基因数据集的有效区分基因子集选择难题,提出基于统计相关性和K-means的新颖混合基因选择算法实现有效区分基因子集选择。算法首先采用Pearson相关系数和Wilcoxon秩和检验计算各基因与类标的相关性,根据统计相关性原则选取与类标相关性较大的若干基因构成预选择基因子集；然后,采用K-means算法将预选择基因子集中高度相关的基因聚集到同一类簇,训练 SVM 分类模型,计算每一个基因的权重,从每一类簇选择一个权重最大或者采用轮盘赌思想从每一类簇选择一个得票数最多的基因作为本类簇的代表基因,各类簇的代表基因构成有效区分基因子集。将该算法与采用随机策略选择各类簇代表基因的随机基因选择算法 Random, Guyon的经典基因选择算法SVM-RFE、采用顺序前向搜索策略的基因选择算法SVM-SFS进行实验比较,几个经典基因数据集上的200次重复实验的平均实验结果表明：所提出的混合基因选择算法能够选择到区分性能非常好的基因子集,建立在该区分基因子集上的分类器具有非常好的分类性能。

Abstr：To deal with the challenging problem of recognizing the small number of distinguishable genes which can tell the cancer patients from normal people in a dataset with a small number of samples and tens of thousands of genes, novel hybrid gene selection algorithms are proposed in this paper based on the statistical correlation and K-means algorithm. The Pearson correlation coefficient and Wilcoxon signed-rank test are respectively adopted to calculate the importance of each gene to the classification to filter the least important genes and preserve about 10 percent of the important genes as the pre-selected gene subset. Then the related genes in the pre-selected gene subset are clustered via K-means algorithm, and the weight of each gene is calculated from the related coefficient of the SVM classifier. The most important gene, with the biggest weight or with the highest votes when the roulette wheel strategy is used, is chosen as the representative gene of each cluster to construct the distinguishable gene subset. In order to verify the effectiveness of the proposed hybrid gene subset selection algorithms, the random selection strategy （named Random） is also adopted to select the representative genes from clusters. The proposed distinguishable gene subset selection algorithms are compared with Random and the very popular gene selection algorithm SVM-RFE by Guyon and the pre-studied gene selection algorithm SVM-SFS. The average experimental results of 200 runs of the aforementioned gene selection algorithms on some classic and very popular gene expression datasets with extensive experiments demonstrate that the proposed distinguishable gene subset selection algorithms can find the optimal gene subset, and the cl

文章关键词：

Keyword：:distinguishable gene subset selection Pearson correlation coefficient Wilcxon singed-rank test K-means clustering statistical correlation Filter algorithms Wrapper algorithms

课题项目：国家自然科学基金（31372250）;中央高校基本科研业务费专项基金（GK201102007）;陕西省科技攻关项目（2013K12-03-24）

本文关键词：基于统计相关性与K-means的区分基因子集选择算法，，由笔耕文化传播整理发布。

本文编号：112108

资料下载

论文发表

支付宝下载

Download by Alipay
微信下载

Download by Wechat
会员下载

Download by Member

本文链接：https://www.wllwen.com/kejilunwen/jiyingongcheng/112108.html

上一篇：《甘肃农业大学》2009年硕士论文
下一篇：转基因抗虫水稻对生物多样性的影响

论文发表

·知网|万方|维普|龙源|省级|国家级|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|