基于机器学习的微孢子虫PolyA位点预测研究
本文选题:微孢子虫 切入点:SVM 出处:《西南大学》2017年硕士论文 论文类型:学位论文
【摘要】:随着人类基因组测序计划的启动和发展,生物信息学应运而生。生物学与信息技术的相互交叉,不仅促进了计算机科学的发展,也极大地推动了生物学的应用研究。西南大学家蚕基因组生物学国家重点实验室是国内一个先进的家蚕研究实验室,目前有家蚕基因组和功能基因组、家蚕遗传资源与蚕桑现代产业技术、蚕桑病原微生物及微生物资源利用等方面的研究。家蚕病原体能够感染家蚕,并能影响家蚕的生长发育,给养蚕业带来较大的损失。因此作为一个研究方向,吸引了越来越多的学者。生物体不断变化,基因组信息也千差万别,机器学习的许多算法已经被运用在人类基因和水稻基因的预测中,然而微孢子虫作为感染家蚕的一种病原体,基于计算机算法的研究却寥寥无几。本文即是利用机器学习中的算法来对微孢子虫Poly A位点进行预测并展开深入的研究。相较于生物学的方法,提高了工作效率,也为生物学中微孢子虫的研究提供了一个很好的思路。机器学习是通过计算的手段利用经验来改善系统自身的性能。随着计算机领域各种新技术和新方法的产生,这些方法逐渐应用到生物信息学领域,并且在基因预测领域的应用越来越广泛。多聚腺苷酸化是真核细胞内形成成熟mRNA的一个重要步骤,其位点的预测对基因组序列中编码基因的发掘具有重要的意义。在与家蚕微孢子虫研究小组经过深入的讨论后,本文以缺乏有效基因预测方法的家蚕病原体微孢子虫Encephalitozoon cuniculi作为研究对象,以Z曲线、位置特异性打分矩阵和k阶核苷酸频率为基础对微孢子虫Encephalitozoon cuniculi基因序列进行特征提取,在提取k阶特征之后,我们对提取的k阶核苷酸频率特征进行组合,并通过实验结果对比,选择最优的组合。把最优组合与位置特异性打分矩阵和Z曲线作为最后的输入特征。通过对该特征进行PCA降维,减少特征空间的维度,从而减少算法复杂度。最后,我们使用不同的分类器对获取到的特征进行训练分类,进而得到微孢子虫PolyA位点的预测结果。该方法能够根据微孢子虫基因序列的表达偏好来选取最优的k阶核苷酸频率特征,这对最后提取微孢子虫PolyA位点的特征起到一定的作用,从而对分类结果产生影响。为了提高微孢子虫PolyA位点预测算法的准确度,选择合适的特征提取方法对后续的分类极其重要。支持向量机被广泛的应用在不同的领域,在文本分类、车牌识别和图像检索等研究领域已有很多成果。本文利用支持向量机、神经网络和KNN算法均对微孢子虫PolyA位点进行了预测研究,实验结果证明支持向量机的分类效果比较好。核函数是支持向量机分类的一个重要因素,鉴于目前条件正定核已经被广泛应用于文本分类和人脸识别领域,在本文实验结果得出的多项式核分类效果比较好的基础之上,将多项式核与条件正定核进行线性组合形成一个新的核函数,并将此混合核函数应用到微孢子虫的PolyA位点预测领域,实验结果表明,混合核函数作为SVM的核函数,通过对模型参数的调整和修改,分类效果有了一个很大的提高。为以后微孢子虫生物学研究提供了便利,也为家蚕病虫害的有效防治提供了一定的依据,具有重要的应用价值。
[Abstract]:With the initiation and development of human genome sequencing, bioinformatics emerged. Cross biology and information technology, not only promoted the development of computer science, but also greatly promote the application of biology. Southwestern University State Key Laboratory of silkworm genome biology is a domestic advanced research laboratory of silkworm, Bombyx mori genome and function at present the genome, genetic resources and technology of sericulture silkworm modern industry, the research of silkworm pathogenic microorganisms and microbial resources. Silkworm pathogens can infect silkworm, Bombyx mori and can affect the growth and development of sericulture, catering to bring greater losses. Therefore as a research direction, has attracted more and more scholars. Organisms are constantly changing, genomic information is different, many machine learning algorithms have been used in the human genome and rice base For the prediction, however, microsporidia as a pathogen of silkworm, the research of computer algorithm based on it. This paper is scanty using machine learning algorithm to microsporidian Poly A sites were analyzed and studied in depth. Compared with the method of biology, improve work efficiency, but also provide a good idea for the study of Microsporidia in biology. Machine learning is the performance through the calculation by means of experience to improve the system of their own. With the development of computer field of various new technologies and new methods, these methods gradually applied to the field of bioinformatics, and the gene prediction is used more and more areas widely. Polyadenylation is an important step in the formation of mature mRNA in eukaryotic cells, is of great significance to explore the prediction of sites for encoding genomic sequences of genes. At home and The silkworm microsporidian research group after in-depth discussion, the lack of effective gene prediction method of silkworm microsporidian pathogen Encephalitozoon cuniculi as the research object, using Z curve, position specific scoring matrix and k order frequency characteristics based on the nucleotide sequence of cuniculi gene of microsporidia Encephalitozoon extraction, after extraction of order k we extracted K features of order nucleotide frequency characteristics, and through the comparison of experimental results, the optimal portfolio selection. The optimal combination and position specific scoring matrix and Z curves as input features. Finally through PCA on the feature reduction, reduce the dimension of feature space, thus reducing the complexity of the algorithm finally, we use different feature classifier by training the obtained classification, and then get the prediction results of microsporidian PolyA sites. This method can According to express a preference for Cryptosporidium micro gene sequence to select the optimal order k nucleotide frequency characteristics, characteristics of the final extraction of microsporidia PolyA sites play a role, so as to affect the classification results. In order to improve the microsporidian PolyA sites prediction algorithm accuracy, choosing the appropriate feature extraction method is very important for the subsequent classification. Support vector machine is widely used in different areas, a lot of achievements in the text classification, license plate recognition and image retrieval research field. This paper uses the existing support vector machine, neural network and KNN algorithm of microsporidia PolyA loci were predicted research, experimental results show that the classification effect of support vector machine is better. The kernel function is an important factor of support vector machine classification, given the current conditions of positive definite kernel has been widely used in text classification and face recognition. In the domain of polynomial kernel classification results the experimental results obtained relatively good foundation, will be conditionally positive definite kernel polynomial kernel and the linear combination of the formation of a new kernel function, and the mixed kernel function is applied to the PolyA locus microsporogonia forecasting field. The experimental results show that the mixed kernel function as kernel function SVM and through the adjustment and modification of the model parameters, the classification results have a greatly improved. For the study of microsporidian biology provides a convenient, provides a basis for effective prevention and control of pests and diseases of silkworm also, has important application value.
【学位授予单位】:西南大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q811.4;TP181
【参考文献】
相关期刊论文 前10条
1 杨亮;张红星;崔英;周钢桥;;可选择性多聚腺苷酸化的生物学功能[J];军事医学;2015年05期
2 李琴;张瑾;骈聪;陈园园;李强;张良云;;基于位置关联权重矩阵及序列组分的多样性增量识别剪接位点[J];生物物理学报;2014年05期
3 刘建伟;刘媛;罗雄麟;;半监督学习方法[J];计算机学报;2015年08期
4 阮越;陈汉武;刘志昊;张俊;朱皖宁;;量子主成分分析算法[J];计算机学报;2014年03期
5 罗洁;林立鹏;潘国庆;刘婷;刘显林;周泽扬;;家蚕微孢子虫NbTom40的原核表达及定位[J];西南大学学报(自然科学版);2013年05期
6 于钊;杜伟;;生物信息学及其广泛应用[J];国际学术动态;2013年02期
7 田鹏;孙雨;邹华;;mRNA3'末端非编码区及其多态性在炎症与免疫中的调控作用[J];医学综述;2012年19期
8 苏煜;山世光;陈熙霖;高文;;基于全局和局部特征集成的人脸识别[J];软件学报;2010年08期
9 滕晓坤;肖华胜;;基因芯片与高通量DNA测序技术前景分析[J];中国科学(C辑:生命科学);2008年10期
10 李艳红;谢俪;潘国庆;吴正理;庞敏;周泽扬;;家蚕微孢子虫抗体免疫荧光检测方法的建立及应用[J];西南农业大学学报(自然科学版);2006年06期
相关博士学位论文 前2条
1 郭锋彪;原核生物蛋白质编码区识别及基因组序列分析[D];天津大学;2005年
2 陈玲玲;原核与真核生物蛋白质编码区识别及基因组分析[D];天津大学;2004年
,本文编号:1639675
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/1639675.html