基于时间序列理论方法的流感病毒DNA序列特征分析

发布时间：2018-05-11 21:41

本文选题：流感病毒 + DNA序列　；参考：《江南大学》2011年硕士论文

【摘要】：流感是一种反复出现的传染病,在全球引起了高发病率和高死亡率.流感病毒分为三类:甲型(A型),乙型(B型),丙型(C型).在这三种类型中甲型流感病毒是最致命的流感病毒,给人类带来了严重的疾病.2009年流感病毒大流行再次爆发,以及20世纪人类经历了好几次流感病毒的爆发,都表明我们对流感病毒的认识还不全面,它们的很多特性还有待于我们进一步挖掘.流感病毒给人类健康带来很大威胁,因此对流感病毒的DNA序列和蛋白质序列的进一步研究是一项迫在眉睫的工作,它们的特征分析对流感病毒的预防、新疫苗的研制、药物分子设计、控制及治疗都具有重要意义. 在介绍了生物信息学的研究背景后,本文介绍了研究生物序列特性的主要方法即时间序列理论方法.该方法主要是通过处理动态数据,进行分析、预测和控制.对本文要用到的ARIMA(p,d,q)模型和ARFIMA(p,d,q)模型的定义、性质及方法作了阐述,为研究流感病毒DNA序列和蛋白质序列特性作了理论上的准备工作.基于CGR坐标将流感病毒DNA序列转换成CGR弧度序列,并引入长记忆模型ARFIMA模型来分析.发现从甲型流感病毒DNA序列中随机找来的10条H1N1序列和10条H3N2序列都具有长相关性且拟合很好,并且还发现这两种序列可以尝试用不同的ARFIMA模型去识别,其中H1N1可用ARFIMA(0,d,5)模型去识别, H3N2可用ARFIMA(1,d,1)模型去识别.接着,对乙型、丙型流感病毒DNA序列进行了分析研究,发现随机找来的10条乙型序列和10条丙型序列同样具有长相关性且拟合很好,还发现这两种序列也可尝试用不同的ARFIMA模型去识别.作为一个具有完善算法的经典时间序列模型,ARFIMA模型能帮助我们挖掘流感病毒DNA序列中未知的特性. 采用ARIMA模型预测甲型流感病毒中H1N1亚型DNA序列碱基,这对H1N1病毒研究有着重要的意义.我们选取1970年-2010年同源性相对较高的41条HINI流感病毒数据,利用ARIMA(p,d,q)模型对前20个位置去拟合并且预测,除极个别外由预报区域显示原始数据都在预报区域内,表明模型建立合理,预报效果很好.基于此,用同样的方法对甲型流感病毒H1N1亚型血凝素氨基酸序列进行了研究分析,同样发现预报效果很好.
[Abstract]:Influenza is a recurrent infectious disease that causes high morbidity and mortality worldwide. Influenza viruses are classified into three types: type A, B and C. Of these three types of influenza viruses, influenza A virus is the deadliest type of influenza virus, causing serious diseases to human beings. The influenza virus pandemic broke out again in 2009, and humans experienced several outbreaks of influenza virus in the 20th century. All show that we are not fully aware of influenza viruses, and many of their characteristics need to be further explored. Influenza viruses pose a great threat to human health, so it is an urgent task to further study the DNA sequence and protein sequence of influenza virus. Their characteristic analysis is the prevention of influenza virus and the development of new vaccine. Drug molecular design, control and treatment are of great significance. After introducing the research background of bioinformatics, this paper introduces the main method of studying the characteristics of biological sequence, that is, the method of time series theory. This method is mainly through dynamic data processing, analysis, prediction and control. In this paper, the definition, properties and methods of the Arima model and the ARFIMA PU DX) model used in this paper are described, and the theoretical preparations for the study of the DNA sequence and protein sequence characteristics of influenza virus are made. The DNA sequence of influenza virus is transformed into CGR Radian sequence based on CGR coordinate, and the long memory model ARFIMA model is introduced to analyze it. It was found that 10 H1N1 sequences and 10 H3N2 sequences from DNA sequences of influenza A virus had long correlation and good fitting, and that the two sequences could be identified with different ARFIMA models. The H1N1 can be identified by the ARFIMA0 / DU (5) model, and the H3N2 by the ARFIMA (1 / 1) model. Then, the DNA sequences of influenza B and C viruses were analyzed and studied. It was found that the 10 Japanese and 10 type C sequences were also highly correlated and fitted well. It is also found that the two sequences can also be identified with different ARFIMA models. As a classical time series model with perfect algorithm, the ARFIMA model can help us to mine unknown characteristics of influenza virus DNA sequences. ARIMA model is used to predict the DNA sequence of H1N1 subtype in influenza A virus, which is of great significance to the study of H1N1 virus. We selected 41 HINI influenza virus data with relatively high homology from 1970 to 2010, and used Arima model to fit and predict the first 20 locations. Except for a few, the original data were all in the forecast area. It shows that the model is reasonable and the forecast effect is very good. Based on this, the amino acid sequence of H1N1 subtype hemagglutinin of influenza A virus was studied by the same method.
【学位授予单位】：江南大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：R346

【引证文献】