当前位置:主页 > 医学论文 > 生物医学论文 >

特征选择与样本选择用于癌分类与药物构效关系研究

发布时间:2018-04-21 17:24

  本文选题:高维特征选择 + 近邻样本选择 ; 参考:《湖南农业大学》2014年博士论文


【摘要】:对于大数据建模,特征选择与样本选择能够大幅度提升模型预测性能、降低建模时间,是构建分类或回归模型的必要步骤与有效手段。本文从特征获取及筛选、学习机器选择、样本选择多角度优化模型,并用于癌基因芯片数据分析(分类)、药物定量构效关系(Quantitative Structure-Activity Relationship, QSAR)研究(回归)。首先,克服传统F测验、最高得分对家族算法等仅单向比较、忽略互作等缺陷,基于不等次重复双向方差分析,双向比较多个基因,整体考虑了多基因与表型互作,经综合加权排序与去冗余获取信息基因;结合转导推理,构建了无需训练的直接分类器。10个多分类肿瘤表达数据的信息基因选择与独立预测多角度比较结果表明:1)新方法以较少的信息基因获得了优于参比模型的平均预测精度(92.06%);2)优于最高得分系列与基于相关性的基因选择算法;3)与支持向量分类相当,优于线性逻辑斯蒂回归与朴素贝叶斯。对白血病与乳腺癌数据,实施多轮基因选择并以基因本体分析生物学通路,发现若干重要生物学通路及致病基因。其次,针对方差分析不适用于回归数据特征选择的弊端,将二元矩阵混排过滤器(Binary Matrix Shuffling Filter, BMSF)用于RPMI8402与P388两个细胞系的抗肿瘤药物QSAR研究。以量子化学计算软件PCLIENT获取2923个高维分子描述符,以BMSF实施特征筛选,以支持向量回归(Support Vector Regression, SVR)建模预测,结果表明:基于文献描述符的SVR模型优于多元线性回归、逐步线性回归、偏最小二乘回归,与人工神经网络相当;对高维描述符,经特征筛选分别保留11个特征,基于保留描述符的SVR模型优于其他参比模型,且非线性回归极显著,多数保留描述符的单因子重要性达显著,对药物活性的效应分析等为高活性抗肿瘤药物设计提供思路。进一步,同时考虑特征筛选与样本选择,将BMSF与地统计学半变异函数用于血管紧张素转化酶抑制剂与人类白细胞抗原Ⅰ型分子结合肽QSAR建模。以531个氨基酸理化性质表征肽序列,以BMSF筛选特征,以地统计学确定公共变程,对每个待测样本,从训练集中选出小于公共变程的K个近邻样本,以SVR实施个性化预测,结果表明:对1593与4779个高维描述符,经特征筛选后5次样本划分中分别平均保留15.4与15.8个特征,独立预测精度Q2pred分别为0.982与0.806,均优于文献参比及单向选择模型。分析了多套描述符子集的残基分布与偏好,为设计高活性肽提供理论指导。本文方法在生物标记物筛选、模式分类、分子活性预测等领域有较广泛应用前景。
[Abstract]:For big data modeling, feature selection and sample selection can greatly improve the performance of model prediction and reduce modeling time, which is a necessary step and an effective means to construct classification or regression model. In this paper, a multi-angle optimization model based on feature acquisition and screening, learning machine selection and sample selection is used in the analysis of oncogene chip data (classification, quantitative Structure-Activity relationship, QSAR). First of all, to overcome the traditional F test, the highest score of the family algorithm only one-way comparison, ignoring the interaction and other defects, based on unequal repeat bidirectional ANOVA, two-way comparison of multiple genes, the overall consideration of multi-gene and phenotypic interaction. Through comprehensive weighted sequencing and deredundancy to obtain information genes; combined with transduction reasoning, A direct classifier without training was constructed. The results of multi-angle comparison of information gene selection and independent prediction for 10 multi-classification tumor expression data show that the new method obtains average preconditioning with fewer information genes than the reference model. The accuracy of the test is 92.06 / 2) better than the highest score series and the correlation-based gene selection algorithm, which is comparable to the support vector classification. It is superior to linear logic Steeles regression and naive Bayes. Based on the data of leukemia and breast cancer, several important biological pathways and pathogenetic genes were found by multiple rounds of gene selection and gene ontology analysis. Secondly, the binary Matrix Shuffling filter (BMSF) was used to study the anticancer drug QSAR in RPMI8402 and P388 cell lines. 2923 high-dimensional molecular descriptors were obtained by quantum chemistry calculation software PCLIENT, and feature screening was carried out by BMSF. The support vector regression support Vector regression (SVR) model was used to model and predict. The results show that the SVR model based on the literature descriptor is superior to the multivariate linear regression model. Stepwise linear regression and partial least square regression are comparable to artificial neural networks. For high-dimensional descriptors, 11 features are retained by feature selection, and the SVR model based on retention descriptors is superior to other reference models, and nonlinear regression is extremely significant. The single factor importance of most retention descriptors is significant, and the effect analysis of drug activity provides ideas for the design of highly active antitumor drugs. Furthermore, BMSF and geostatistical semivariogram were used to model angiotensin-converting enzyme inhibitor (ACEI) and human leukocyte antigen type I molecular binding peptide (QSAR). The peptide sequence was characterized by 531 amino acid physicochemical properties, and the common variable was determined by geostatistics by BMSF screening. For each sample to be tested, K nearest neighbor samples were selected from the training set, and the individual prediction was carried out by SVR. The results show that for 1593 and 4 779 high dimensional descriptors, the average of 15.4 and 15.8 features are retained in the 5 samples after screening, and the independent prediction accuracy Q2pred is 0.982 and 0.806, respectively, which is superior to the reference ratio and one-way selection model. The residue distribution and preference of multiple sets of descriptor subsets are analyzed, which provides theoretical guidance for the design of highly active peptides. This method has been widely used in the fields of biomarker screening, pattern classification, molecular activity prediction and so on.
【学位授予单位】:湖南农业大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:Q811.4;R96

【参考文献】

相关期刊论文 前1条

1 张学工;关于统计学习理论与支持向量机[J];自动化学报;2000年01期



本文编号:1783372

资料下载
论文发表

本文链接:https://www.wllwen.com/yixuelunwen/swyx/1783372.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户77f5a***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com