基于机器学习的A型流感病毒跨种传播和抗原关系预测研究
发布时间:2018-06-30 08:38
本文选题:机器学习 + 支持向量机 ; 参考:《华中科技大学》2012年博士论文
【摘要】:禽流感病毒是禽适应的A型流感病毒,在过去的十几年间,禽流感病毒的跨种传播给人类社会造成了重大的生命财产损失,引起了社会的高度关注。H3N2亚型流感病毒是另一种对人类社会具有重要影响的A型流感病毒,它的抗原变异让疫苗失去作用,为全球流感病毒监控工作带来较大的困难。研究这两类A型流感病毒的跨种传播和抗原关系具有重要的理论和现实意义。基于机器学习、信息论、特征选择等方法研制并改进了禽流感病毒禽到人的跨种传播和H3N2亚型流感病毒的抗原关系预测模型,同时识别了禽流感病毒禽到人传播的90个特征氨基酸位置以及18个H3N2流感病毒抗原变异关键氨基酸位置,从而可以为公共健康提供早期预警,为相关的分子决定因素和底层机制研究提供思路。 首先,根据现在尚未有实验验证的不能实现禽到人传播的禽流感病毒的情况,结合一分类SVM适用于负样本较难确定的问题的特点,探索了使用一分类SVM来预测禽流感病毒禽到人传播的可行性,通过氨基酸组成、二肽组成及自相关系数编码禽流感病毒蛋白质序列,构建了一分类SVM预测模型,其预测精度超过了当前已有的反向神经网络预测模型。 其次,在前期工作建测试用的负样本时,发现构建的负样本比已有的预测模型中用到的负样本具有更高的可靠性,因此扩大了两类样本的数据规模并采取传统的两分类方法提升预测禽流感病毒禽到人的跨种传播同时挖掘有生物学意义的特征。通过信息熵的方法首先选择了90个特征氨基酸位置,,基于理化性质编码这些特征位置后使用了多种特征选择方法包括Relief,mRMR,信息增益及遗传算法选取了最优特征子集,利用这个最优特征子集构建的预测模型性能有了大幅提高,同时最终选择的理化特性在两类样本中差异明显,表明了这些特征的有效性,此外其中的两个理化性质得到多个生物学研究结果的支持。 再次,人工收集了来自于相关文献中记录的H3N2流感病毒抗原变异数据,将最近三个H3N2抗原变异研究中用到的数据规模扩大了近一倍。然后比较了多种打分策略,包括优势比,互信息,Phi相关系数并联合多元线性回归最终识别了18个H3N2流感病毒抗原变异关键位置,这18个关键位置均位于HA蛋白的5个抗原表位中,有8个位置与已识别的正选择位置相吻合,说明了本研究识别的18个抗原变异关键位置对H3N2流感病毒抗原变异具有重要作用。 最后,在上一部分工作的基础上,期望改进H3N2流感病毒抗原关系的预测模型,降低假阳性。基于氨基酸的某些突变可能并不造成抗原变异,而当理化性质改变时才造成抗原变异的提示,集成了多种理化性质变化来改进预测H3N2流感病毒的抗原关系。通过互信息与层次聚类筛选了候选理化性质,最终的实验结果表明构建的预测模型比上一部分工作构建的模型性能有了较大提高,同时优于当前其他三个H3N2抗原关系预测模型,包括汉明距离预测模型,分组打分多元线性回归模型以及决策树。此外进一步构建了H3N2流感病毒抗原关系预测的Web工具,为相关研究人员提供在线服务。
[Abstract]:Avian influenza virus is avian influenza virus A. In the past decade, the transmission of avian influenza virus has caused great loss of life and property to human society. It has attracted social attention that.H3N2 subtype influenza virus is another A influenza virus which has important effects on human society. Its antigen variation makes vaccines. The study of the cross species transmission and antigen relationship of these two types of A influenza viruses has important theoretical and practical significance. Based on machine learning, information theory, feature selection and other methods, it has developed and improved avian influenza virus to human trans species transmission and H3N2 subtype influenza virus. The antigen relationship prediction model, at the same time identified the 90 characteristic amino acid positions of avian influenza virus avian to human transmission and the position of 18 H3N2 influenza virus antigen variant key amino acids, can provide early warning for public health, and provide ideas for the related molecular determinants and the underlying mechanism.
First, based on the fact that avian influenza virus can not be transmitted to human transmission, a classified SVM is suitable for the characteristics of the more difficult negative samples. A classification of SVM is used to predict the feasibility of avian influenza virus to human transmission, through the composition of amino acid, the composition of two peptide and the autocorrelation coefficient. A SVM prediction model is constructed based on the protein sequence of the code avian influenza virus, and its prediction accuracy is higher than that of the existing reverse neural network prediction model.
Secondly, when the negative sample used in the previous work is built, it is found that the negative sample constructed is more reliable than the negative sample used in the existing prediction model. Therefore, the data scale of the two types of samples is expanded and the traditional two classification method is adopted to improve the prediction of the cross species transmission of avian influenza virus to human and to excavate the biological meaning. First, 90 characteristic amino acids are selected by the information entropy method. After coding these characteristics based on physicochemical properties, a variety of feature selection methods, including Relief, mRMR, information gain and genetic algorithm, are used to select the best feature subset. The performance of the prediction model constructed with this optimal subset is significant. At the same time, the physical and chemical properties of the final selection are distinct in the two types of samples, indicating the effectiveness of these characteristics, and the two physical and chemical properties of them are supported by the results of multiple biological studies.
Again, the H3N2 influenza virus antigen variation data from the related literature were collected artificially and nearly doubled the size of the data used in the recent three H3N2 antigens variation studies. Then a variety of scoring strategies were compared, including dominance ratio, mutual information, Phi correlation coefficient and combined multiple linear regression to identify 18 H3N2 flows. The 18 key positions of the 18 key positions are located in the 5 epitopes of the antigen, and 8 positions are in accordance with the identified positive selection positions. It shows that the key positions of the 18 antigens identified in this study are important for the H3N2 influenza virus antigen variation.
Finally, on the basis of the previous work, we expect to improve the prediction model of the H3N2 influenza virus antigen relationship and reduce the false positive. Some mutations based on amino acids may not cause the antigen variation, but when the physical and chemical properties change, the antigen variation can be prompted, and many kinds of physical and chemical changes are integrated to improve the prediction of the H3N2 influenza virus. Antigen relationship. The candidate physicochemical properties are screened by mutual information and hierarchical clustering. The final experimental results show that the predicted model is better than the previous model of the previous three H3N2 models, including the Hamming distance prediction model, and the grouping is divided into multiple linear regression. In addition, the Web tool for prediction of H3N2 influenza antigen relationship was further constructed to provide online services for relevant researchers.
【学位授予单位】:华中科技大学
【学位级别】:博士
【学位授予年份】:2012
【分类号】:R373;TP181
【共引文献】
相关期刊论文 前2条
1 何冰;宋晓峰;;基于蛋白质序列的泛素化位点预测研究进展[J];现代生物医学进展;2012年18期
2 卢亮;李栋;贺福初;;蛋白质泛素化修饰的生物信息学研究进展[J];遗传;2013年01期
相关博士学位论文 前2条
1 李立奇;rFN/CDH的亚细胞位点预测及基于LbL技术的rFN/CDH仿生界面的构建及初步评价[D];第三军医大学;2012年
2 陈震;基于序列信息的蛋白质功能位点预测的算法开发[D];中国农业大学;2014年
本文编号:2085682
本文链接:https://www.wllwen.com/xiyixuelunwen/2085682.html
最近更新
教材专著