基于打分准则和微粒群算法的基因选择方法研究

发布时间：2018-03-16 21:36

本文选题：基因表达谱数据　切入点：基因选择　出处：《江苏大学》2017年硕士论文　论文类型：学位论文

【摘要】：癌症作为现今社会的生命杀手,种类繁多,治疗方法各异,尽早确诊对症下药是挽救生命的关键。基因芯片的出现为人类在分子角度认识疾病机理提供了新的路径,通过对基因表达谱数据进行挖掘发现致病基因对癌症的诊断和治疗具有重要意义。虽然不少基因选择方法能够选出具有较高分类性能的基因子集,但这些方法存在算法时间开销大,选出的基因解释性差冗余度高的缺点,为了克服这些方法的不足,本文在提出一种有效打分机制的基础上,利用微粒群算法和极限学习机进行基因选择,选择出了分类性能高、可解释性好的基因集合。本文的主要工作如下:(1)针对传统基因选择方法时间开销大、选出的基因子集可解释性差的缺陷,提出了一种基于打分准则和改进PSO算法的基因选择方法。首先利用分类信息指数对原始基因池进行预处理,基于数学抽样调查的科学性随机生成限定基因数目的基因集合矩阵,利用极限学习机对基因集合进行评价,并挑选出满足条件的基因集合,然后利用打分准则对基因进行评价、排序,并筛选相关基因;最后利用模拟退火算法改进PSO算法,并对打分准则评价后的基因进行进一步选择。该方法步骤简单,时间开销小。在多个公开的基因表达谱数据集上的实验——结果表明相比其他方法,由于大量准确的冗余删除,可以快速高效的选择出与肿瘤类别高度相关的基因子集。(2)针对打分准则机制未能充分利用基因与分类相关的直接信息以及PSO算法依然易于陷入局部最优的缺陷提出了基因信息加权和粒子半初始化的改进方法。首先根据方差的大小调整求取平均适应度值的次数,然后利用基因本身包含的分类权重信息作为打分准则的新增评价标准来完善打分机制,最后针对PSO算法易于陷入局部最优的缺点,设定更新阈值,迫使一半粒子在范围内更新从而改进算法。改进的方法充分利用了基因自身包含的信息,使得打分机制更加合理;相比其他方法能更快跳出局部最优。在四个数据集上的实验结果表明,在基于信息加权和微粒群算法的基础上,进一步提高了所选基因子集的分类准确率。
[Abstract]:Cancer is the killer of life in today's society. There are many kinds of cancer. The correct diagnosis of cancer is the key to saving lives. The emergence of gene chip provides a new way for human to understand the mechanism of disease from a molecular perspective. By mining the data of gene expression profiles, we find that pathogenic genes are of great significance for the diagnosis and treatment of cancer. Although many gene selection methods can select subsets of genes with high classification performance, However, these methods have some disadvantages such as high time cost and high redundancy of genetic interpretive difference. In order to overcome the shortcomings of these methods, an effective scoring mechanism is proposed in this paper. Using particle swarm optimization algorithm and extreme learning machine to select gene sets with high classification performance and good interpretability. The main work of this paper is as follows: (1) the traditional gene selection methods cost a lot of time. In this paper, a method of gene selection based on scoring criterion and improved PSO algorithm is proposed. Firstly, the classification information index is used to preprocess the original gene pool. Based on the scientific random generation of gene set matrix with limited number of genes based on mathematical sampling survey, the gene set is evaluated by extreme learning machine, and the gene set that meets the condition is selected, and then the gene is evaluated by scoring criterion. Sequencing and screening related genes. Finally, the simulated annealing algorithm is used to improve the PSO algorithm, and further selection of the genes evaluated by the scoring criteria is carried out. The steps of this method are simple, Experimental results on multiple published gene expression data sets show that, compared with other methods, due to a large number of accurate redundant deletions, We can quickly and efficiently select a subset of genes that are highly related to the tumor category. (2) aiming at the failure of the scoring criterion mechanism to make full use of the direct information related to the classification of genes and the fact that the PSO algorithm is still prone to fall into the local optimal defect proposal. An improved method of genetic information weighting and particle semi-initialization is proposed. Firstly, the average fitness is calculated according to the magnitude of variance. Then, the classification weight information contained by gene itself is used as the new evaluation criterion to improve the scoring mechanism. Finally, aiming at the disadvantage of PSO algorithm which is prone to fall into local optimum, the update threshold is set. The algorithm is improved by forcing half of the particles to update in the range. The improved method makes full use of the information contained in the gene itself and makes the scoring mechanism more reasonable. Experimental results on four datasets show that the classification accuracy of the selected gene subset is further improved on the basis of information weighting and particle swarm optimization.
【学位授予单位】：江苏大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R73-3;TP18

【参考文献】