基因表达数据聚类分析算法研究和应用
发布时间:2018-10-18 19:26
【摘要】: 随着基因芯片技术的广泛应用,产生了海量的基因表达数据。如何分析和处理这些数据,从中提取有用的生物学或医学信息,是基因芯片技术应用的关键和难点,其研究已成为后基因组时代的热点之一。聚类分析能将功能相关的基因按表达谱的相似程度归纳成共同表达类别,有助于对基因功能、基因调控、细胞过程及细胞亚型等进行综合研究,是目前基因表达数据分析的主要技术之一。本文针对基因表达数据聚类分析中聚类算法和参数的选择、聚类结果的有效性评价和类数估计等具体问题,主要工作和创新点如下: 1.首次采用具有外部标准的基因表达数据集,研究了基因聚类分析中层次聚类、K-means聚类和SOMs等最为常用的算法对相似度和数据转换方式的选择,比较了各类算法的性能。结果表明:层次聚类宜以Pearson相关系数为相似度,并对数据进行行标准化转换;K-means聚类和SOMs则宜选择Euclidean距离准则和标准化对数转换的数据。并且,应尽量避免使用单连接层次聚类, K-means聚类与SOMs算法的性能显著优于层次聚类。 2.研究了Silhouette指数、Dunn’s指数、Davies-Bouldin指数及FOM测量对基因聚类分析结果的确认能力。结果表明:Silhouette指数和FOM测量能较好地反映聚类算法的性能和聚类结果的质量,Dunn’s指数因其对噪声的高度敏感性不能直接用于基因聚类结果的确认,Davies-Bouldin指数的确认能力好于Dunn’s指数,但偏爱单连接聚类。 3.对Silhouette指数、Davies-Bouldin指数、FOM测量等函数的类数估计能力进行了研究。结果表明:Silhouette指数和Davies-Bouldin指数估计确切类数的正确率都比较低,难于实际应用;FOM测量的拐点位置只能粗略估计大致的类数,并含有不确定性和主观性。定义了新的相对Silhouette指数和相对Davies-Bouldin指数,以扩展现有Silhouette指数和Davies-Bouldin指数估计类数的能力。引入了类数估计专用函数-预测强度进行基因聚类分析中类数的估计,提高了类数估计的可靠性。 4.针对高分辨率SOMs投影结果难于确定类边界的问题,采用K-means对SOMs训练后的网络单元聚类,实现了SOMs算法与K-means聚类的有机结合。采用SOMs与K-means相结合的聚类方法对酵母二次迁移全基因组表达数据进行了系统分析,得到了表达谱十分相似的基因类,为未知基因的功能预测提供了重要线索。
[Abstract]:With the wide application of gene chip technology, huge amounts of gene expression data are produced. How to analyze and process these data and extract useful biological or medical information from them is a key and difficult point in the application of gene chip technology. Its research has become one of the hotspots in the post-genome era. Cluster analysis can induce functional related genes into coexpression categories according to the similarity of expression profile, which is helpful for the comprehensive study of gene function, gene regulation, cell process and cell subtype. It is one of the main techniques of gene expression data analysis. In this paper, the selection of clustering algorithms and parameters in clustering analysis of gene expression data, the evaluation of the validity of clustering results and the estimation of cluster number are discussed. The main work and innovation are as follows: 1. Using the gene expression data set with external standard for the first time, this paper studies the selection of similarity and data conversion methods among the most commonly used algorithms in gene clustering analysis, such as hierarchical clustering, K-means clustering and SOMs, and compares the performance of various algorithms. The results show that the hierarchical clustering should take the Pearson correlation coefficient as the similarity and the data should be standardized converted, while the K-means clustering and the SOMs clustering should choose the Euclidean distance criterion and the normalized logarithmic transformation data. Moreover, single join hierarchical clustering should be avoided as far as possible. The performance of K-means clustering and SOMs clustering is significantly better than that of hierarchical clustering. 2. The ability of Silhouette index, Dunn's index, Davies-Bouldin index and FOM to confirm the results of gene cluster analysis was studied. The results show that Silhouette exponent and FOM measurement can well reflect the performance of clustering algorithm and the quality of clustering results. Because of its high sensitivity to noise, Dunn's index can not be directly used to confirm gene clustering results, and Davies-Bouldin index can confirm clustering results. The force is better than the Dunn's index, But preferred single join clustering. 3. The ability of class number estimation of Silhouette exponent, Davies-Bouldin exponent and FOM measurement is studied. The results show that the correct rate of Silhouette exponent and Davies-Bouldin exponent estimate the exact number of classes is low, which is difficult to be applied in practice, and the inflection point position of FOM measurement can only roughly estimate the approximate number of classes, with uncertainty and subjectivity. New relative Silhouette exponents and relative Davies-Bouldin exponents are defined to extend the ability of existing Silhouette exponents and Davies-Bouldin exponents to estimate class numbers. In this paper, a special function of cluster number estimation is introduced to estimate the number of clusters in gene cluster analysis, which improves the reliability of cluster number estimation. 4. In order to solve the problem that the high resolution SOMs projection results are difficult to determine the class boundary, K-means is used to cluster the network units trained by SOMs, and the combination of SOMs algorithm and K-means clustering is realized. By using SOMs and K-means clustering method, the whole genome expression data of yeast secondary migration were systematically analyzed, and the gene classes with similar expression profiles were obtained, which provided an important clue for the function prediction of unknown genes.
【学位授予单位】:天津大学
【学位级别】:博士
【学位授予年份】:2006
【分类号】:R311
[Abstract]:With the wide application of gene chip technology, huge amounts of gene expression data are produced. How to analyze and process these data and extract useful biological or medical information from them is a key and difficult point in the application of gene chip technology. Its research has become one of the hotspots in the post-genome era. Cluster analysis can induce functional related genes into coexpression categories according to the similarity of expression profile, which is helpful for the comprehensive study of gene function, gene regulation, cell process and cell subtype. It is one of the main techniques of gene expression data analysis. In this paper, the selection of clustering algorithms and parameters in clustering analysis of gene expression data, the evaluation of the validity of clustering results and the estimation of cluster number are discussed. The main work and innovation are as follows: 1. Using the gene expression data set with external standard for the first time, this paper studies the selection of similarity and data conversion methods among the most commonly used algorithms in gene clustering analysis, such as hierarchical clustering, K-means clustering and SOMs, and compares the performance of various algorithms. The results show that the hierarchical clustering should take the Pearson correlation coefficient as the similarity and the data should be standardized converted, while the K-means clustering and the SOMs clustering should choose the Euclidean distance criterion and the normalized logarithmic transformation data. Moreover, single join hierarchical clustering should be avoided as far as possible. The performance of K-means clustering and SOMs clustering is significantly better than that of hierarchical clustering. 2. The ability of Silhouette index, Dunn's index, Davies-Bouldin index and FOM to confirm the results of gene cluster analysis was studied. The results show that Silhouette exponent and FOM measurement can well reflect the performance of clustering algorithm and the quality of clustering results. Because of its high sensitivity to noise, Dunn's index can not be directly used to confirm gene clustering results, and Davies-Bouldin index can confirm clustering results. The force is better than the Dunn's index, But preferred single join clustering. 3. The ability of class number estimation of Silhouette exponent, Davies-Bouldin exponent and FOM measurement is studied. The results show that the correct rate of Silhouette exponent and Davies-Bouldin exponent estimate the exact number of classes is low, which is difficult to be applied in practice, and the inflection point position of FOM measurement can only roughly estimate the approximate number of classes, with uncertainty and subjectivity. New relative Silhouette exponents and relative Davies-Bouldin exponents are defined to extend the ability of existing Silhouette exponents and Davies-Bouldin exponents to estimate class numbers. In this paper, a special function of cluster number estimation is introduced to estimate the number of clusters in gene cluster analysis, which improves the reliability of cluster number estimation. 4. In order to solve the problem that the high resolution SOMs projection results are difficult to determine the class boundary, K-means is used to cluster the network units trained by SOMs, and the combination of SOMs algorithm and K-means clustering is realized. By using SOMs and K-means clustering method, the whole genome expression data of yeast secondary migration were systematically analyzed, and the gene classes with similar expression profiles were obtained, which provided an important clue for the function prediction of unknown genes.
【学位授予单位】:天津大学
【学位级别】:博士
【学位授予年份】:2006
【分类号】:R311
【引证文献】
相关期刊论文 前6条
1 汪雪红;焦清局;常盼盼;黄继风;;基于最小编码长度的基因数据聚类[J];安徽农业科学;2012年19期
2 王祥林;;基于矩阵变换的层次聚类在基因表达数据分析中的应用研究[J];计算机光盘软件与应用;2012年24期
3 汪雪红;焦清局;常盼盼;黄继风;;基于最小编码长度的基因数据聚类(英文)[J];Agricultural Science & Technology;2012年06期
4 梅娟;徐明亮;胡e,
本文编号:2280135
本文链接:https://www.wllwen.com/yixuelunwen/binglixuelunwen/2280135.html
最近更新
教材专著