验证酵母基因组序列中8-mer的独立进化规律和生物学功能

发布时间:2018-08-02 17:39
【摘要】:全基因组序列k-mer的使用是非随机的,不同种类的k-mer具有不同的生物学功能,发掘k-mer使用规律以及k-mer的生物学功能对于基因组结构进化和系统理解功能片段非常重要。上百个物种的k-mer频谱研究发现四足动物的k-mer频谱是多峰分布,其他生物的k-mer频谱是单峰分布。K-mer多峰谱产生的原因众说纷纭,有研究指出不同类型的功能或结构元件是产生多峰谱的主要原因,也有研究认为多峰谱是以G+C含量和CpG抑制为特征,还有研究认为多峰是由两类稀有k-mer形成的。所以基因组k-mer频谱产生的原因仍待研究。论文运用统计分析和生物信息学等方法,结合人类k-mer频谱的分布规律,研究了酵母基因组序列k-mer频谱的规律,探讨了 CG类8-mer子集的独立进化机制,对CG类模体的生物学功能给出理论猜测和验证。主要研究内容如下:(1)计算得到人类1号染色体序列的8-mer相对模体数随频次的分布(简称8-mer频谱),发现8-mer频谱是三峰分布。将全部8-mer按照16种XY二核苷分类分成三个子集后,发现仅有CG二核苷分类下的三个子集CG0(不包含CG二核苷的8-mer)、CG_1(包含一个CG的8-mer)和CG_2(包含两个或两个以上CG的8-mer)各自形成独立的单峰分布,称之为CG类模体的独立进化规律。三个CG模体子集的分布位置与总体8-mer分布的三个峰严格对应。由此得出三个CG子集分布距离的远近是决定单峰分布还是多峰分布的直接原因。与随机序列的8-mer频谱比较,发现CG0模体的频谱位于随机中心附近,CG_1和CG_2模体的频谱远离随机中心。表明包含CG二核苷的8-mer是定向进化,不包含CG二核苷的8-mer是随机进化。CG三个子集的分布具有两个特征:(i)CG_2和CG_1分布的最概然频次明显低于CG0分布;(ii)CG_2和CG_1分布的宽度明显窄于CG0分布。这两特征表明CG_2和CG_1子集中的8-mer使用是保守的。分析三个CG子集、核小体中心序列(NCSs)和CpG岛(CGIs)的序列特征后,提出两个理论猜想:CG_1模体是核小体结合模体;CG_2模体是CGIs的模体单元。(2)酵母基因组序列的8-mer频谱为单峰分布。计算酵母中16种二核苷分类下8-mer相对模体数随频次的分布,发现只有CG子集分布具备人类CG子集分布的两个特征,表明酵母中CG_2和CG_1子集中的8-mer使用也是保守的,以及酵母的单峰分布是三个CG子集分布太近叠加后的结果。因此得到这样的结论:CG模体使用的进化独立规律从最简单的真核生物酵母就开始了。由于CG子集模体数目众多,用三个CG子集中m-mer(m=2,3,4)的频率来表征CG子集的模体信息。首先分析发现三个CG子集模体信息偏离总体8-mer的程度各不相同。然后考察了酵母基因组序列在16种XY1分类下m-mer使用的总偏离(新对称相对熵NSRE),发现CG分类下的模体使用偏离最大。得出CG二核苷在从简单到复杂的基因组进化中是功能元件产生和进化"核心"的结论。(3)为了验证CG_1模体是否是核小体结合模体,分别将CG0、CG_1和CG_2子集的模体信息赋值到酵母的核小体中心序列和连接序列上做二分类评估。结果指出基于CG_1模体信息得到的平均ROC面积(AUC)最大,说明CG_1模体比起CG0和CG_2模体更偏好核小体中心序列。然后基于CG_1子集模体信息得到核小体中心序列上的NSRE分布,该分布与已出版的结果一致。结果显示富含模体决定核小体的基本框架,稀有模体决定核小体的精细结构。将标准组蛋白八聚体沿着DNA双链展开成一维排列后,NSRE分布的极大值区域与八个组蛋白位置存在极好的一一对应关系。这两个结果共同验证了 CG_1模体是核小体结合模体的猜想。(4)统计分析单碱基精度核小体位置数据,发现一些核小体处于挤压状态。根据挤压的位置将核小体分为四类:标准核小体;上游挤压核小体;下游挤压核小体;两端挤压核小体。基于CG_1模体是核小体结合模体的结论,分析了四类核小体中心序列上NSRE的分布特征,发现挤压核小体随着挤压端和非挤压端序列结构的变化而变化,而且核小体受挤压的区域其序列的组织性更强。随后,核小体连接序列按长度增长的方式分类为11个长度组,利用MEME在线软件搜索了 11个长度组中的保守模体,发现有四类保守模体,意味着连接序列的多样性。(5)为了验证CG_2模体是否是CGIs的模体单元,分别将CG_2、CG_1和CG0模体信息赋值到酵母的CGIs和相应的非CpG岛序列上做ROC分析,得到的平均AUC值分别为0.95,0.80和0.02,显示CG_2模体信息与CGIs的构成信息非常符合。在ROC曲线上选取最佳临界值,计算该临界值下的总精度(AAC)和相关系数(MCC),该结果进一步确认了 CG_2模体信息可以表征CGIs序列,从而验证了 CG_2模体是CGIs的结构单元。
[Abstract]:The use of the whole genome sequence k-mer is nonrandom. Different kinds of k-mer have different biological functions. The discovery of k-mer usage and the biological function of k-mer are very important for the genome structure evolution and systematic understanding of functional fragments. The k-mer spectrum of hundreds of species found that the k-mer spectrum of quadruped is a multi peak. The k-mer spectrum of cloth and other organisms is the cause of the generation of the multi peak spectrum of the single peak distribution of.K-mer. Some research points out that different types of functional or structural elements are the main reasons for the generation of multi peak spectrum. There are also studies that the multi peak spectrum is characterized by G+C content and CpG suppression, and that the multi peak is formed by two kinds of rare k-mer. The cause of the k-mer spectrum is still to be studied. By means of statistical analysis and bioinformatics, this paper studies the law of k-mer spectrum in the yeast genome sequence and discusses the independent evolution mechanism of the 8-mer subset of the CG class, and gives a theoretical guess and test for the biological function of the CG class 8-mer. The main research contents are as follows: (1) the number of 8-mer relative modules of the human chromosome 1 sequence was calculated with the frequency distribution (8-mer spectrum), and the 8-mer spectrum was found to be the three peak distribution. After dividing all 8-mer into three subsets according to the classification of XY two nucleosides, only three subset of CG0 (not including CG two nucleosides) was found. 8-mer), CG_1 (including a 8-mer of CG) and CG_2 (including two or more than two CG 8-mer) each forms an independent single peak distribution, which is called the independent evolution law of CG class modules. The distribution of the three CG module subset is strictly corresponding to the three peaks of the overall 8-mer distribution. Thus, the distance and proximity of the three CG subset distribution distance is the decision sheet. The peak distribution is the direct cause of the multi peak distribution. Compared with the 8-mer spectrum of random sequence, the spectrum of the CG0 model body is located near the random center, and the spectrum of CG_1 and CG_2 modules is far from the random center. It shows that the 8-mer containing CG two nucleosides is directed evolution, and the 8-mer that does not contain CG two nucleosides is a random evolution.CG three subset distribution with two Characteristics: (I) the most probability of CG_2 and CG_1 distribution is obviously lower than the CG0 distribution; (II) the width of CG_2 and CG_1 distribution is narrower than CG0 distribution. These two features indicate that 8-mer use of CG_2 and CG_1 subsets is conservative. After analyzing the sequence characteristics of three CG subsets, nucleosome Central sequences and islands, two theoretical conjectures are proposed. The body is a nucleosome binding model body; the CG_2 model body is the module unit of the CGIs. (2) the 8-mer spectrum of the yeast genome sequence is a single peak distribution. The distribution of the relative modules of the 8-mer in the taxonomy of the 16 species of two nucleosides in yeast is calculated with the frequency distribution. It is found that only the CG subset distribution has two characteristics of the human CG subset distribution, indicating that the concentration of CG_2 and CG_1 in the yeast is 8. The use of -mer is also conservative, and the single peak distribution of yeast is the result of three CG subset distribution too close. Therefore, it is concluded that the evolutionary independence of CG modules begins with the simplest eukaryote yeast. As the number of CG subset modules is large, the frequency of m-mer (m=2,3,4) is used to characterize CG with the concentration of the m-mer (m=2,3,4) in the subset of the CG subsets. First, it is found that the degree of the three CG subset model body information deviates from the overall 8-mer. Then the total deviation (new symmetric relative entropy NSRE) of the yeast genome sequence under the 16 XY1 classifications (new symmetric relative entropy NSRE) is investigated. It is found that the use deviation of the model body under the CG classification is the largest. It is found that the CG two nucleosides are in the simple to complex basis. (3) in order to verify whether the CG_1 model body is a nucleosome binding model body, the model body information of the CG0, CG_1 and CG_2 subset is assigned to the yeast nucleosome center sequence and the connection sequence, respectively. The results indicate the average ROC area based on the CG_1 model body information. AUC) maximum, indicating that the CG_1 module preferred the nucleosome center sequence more than the CG0 and CG_2 modules. Then, based on the CG_1 subset model body information, the NSRE distribution on the nucleosome center sequence is obtained. The distribution is in accordance with the published results. The results show that the basic framework of the nucleosomes is determined by the model body, and the rare model determines the fine structure of the nucleosome. After the paraminin eight polymer is arranged in one dimension along the DNA double strand, the maximum region of the NSRE distribution has an excellent one-to-one correspondence with the position of the eight histone. These two results jointly verify that the CG_1 module is the conjecture of the nucleosome binding mode body. (4) statistical analysis of the location data of the mono base nucleosome, and the discovery of some nucleosomes The nucleosome is divided into four types according to the position of extrusion: the standard nucleosome, the upstream extruding nucleosome, the downstream extruding nucleosome, the extruding nucleosome at the two ends. Based on the conclusion of the nucleosome binding die body, the CG_1 model body has analyzed the distribution characteristics of the NSRE in the central sequence of the nucleosome, and found that the extruded nucleosome is with the extrusion end and non extrusion. The sequence of the pressure end sequence changes, and the region of the nucleosome is squeezed is more organized. Then, the nucleosome connection sequence is classified into 11 length groups according to the length of the length, and the MEME online software is used to search the conservative modules of the 11 length groups, and four kinds of conservative modules are found, which means the diversity of the connection sequences. (5) (5) in order to verify whether the CG_2 module is a module unit of CGIs, the CG_2, CG_1 and CG0 module information is assigned to the CGIs of yeast and the corresponding non CpG Island sequence for ROC analysis. The average AUC values are 0.95,0.80 and 0.02 respectively, showing that the CG_2 module information is very consistent with the information of the CGIs. The total accuracy (AAC) and the correlation coefficient (MCC) under the critical value are calculated. The results further confirm that the CG_2 module information can characterize the CGIs sequence, thus verifying that the CG_2 module is a structural unit of the CGIs.
【学位授予单位】:内蒙古大学
【学位级别】:博士
【学位授予年份】:2017
【分类号】:Q78

【参考文献】

相关期刊论文 前1条

1 尼玛达瓦;李宏;周德良;郑燕;杨小希;;酵母核小体中心序列与连接序列的差异分析[J];内蒙古大学学报(自然科学版);2015年02期



本文编号:2160180

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/jckxbs/2160180.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户ab594***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com