基于相似性比对改进KNN的蛋白质亚细胞定位预测研究
发布时间:2019-05-07 09:03
【摘要】:蛋白质的功能与其所处的亚细胞区间紧密相关,通过对蛋白质的亚细胞区间预测研究能够帮助我们了解蛋白质的功能信息,对于生物研究有重要意义。传统通过实验的方式获得蛋白质亚细胞区间信息不仅耗时久、成本高,而且不利于大量蛋白序列的区间定位,因此需要找到一种高效的蛋白质亚细胞区间预测方法。本文中介绍了蛋白序列的特征提取算法并对传统K最近邻(k-NearestNeighbor,KNN)分类器进行改进,提出一种基于相似性比对改进KNN的蛋白质亚细胞分类预测算法,通过AdaBoost和Bagging进行集成预测,取得较好的实验效果,本文主要工作如下:本文主要介绍了氨基酸组成、二肽、伪氨基酸组成三种特征提取算法;除了公共数据集ZD98,CH317,还构建了新的数据集Gram1253;对传统KNN分类器进行改进,使用Blast比对寻找最相似序列完成KNN算法的决策,提出一种新的分类预测算法:相似性比对KNN预测算法,在三个数据集上进行Jackknife检验,成功率分别为93.9%,91.5%和92.5%;随后引入Hadoop分布式计算框架对算法进行优化。为了进一步对预测算法进行研究,本文采用AdaBoost和Bagging算法对多个相似性比对KNN分类器进行集成后对蛋白序列的亚细胞区间进行预测,三个数据集在Jackknife检验下,AdaBoost的预测成功率分别为94.9%,92.4%和93.1%。由于ZD98和CH317数据集区间分布不均衡,Bagging集成算法的预测准确率低于相似性比对KNN算法,为89.8%和87.7%。但在Gram1253上实验效果较好,预测准确率达到92.9%,实验结果表明AdaBoost和Bagging集成分类预测方法是一种较为有效的蛋白质亚细胞区间预测方法。
[Abstract]:The function of protein is closely related to its subcellular interval. The prediction of subcellular interval of protein can help us to understand the functional information of protein, which is of great significance for biological research. The traditional method of obtaining protein subcellular interval information by experiment is not only time-consuming, high-cost, but also unfavorable to the localization of a large number of protein sequences, so it is necessary to find an efficient method of protein subcellular interval prediction. In this paper, the feature extraction algorithm of protein sequence is introduced, and the traditional K nearest neighbor classifier is improved. A novel protein subcellular classification prediction algorithm based on similarity ratio based on improved KNN is proposed. Through AdaBoost and Bagging integrated prediction, good experimental results have been obtained. The main work of this paper is as follows: this paper mainly introduces three feature extraction algorithms: amino acid composition, dipeptide, pseudo amino acid composition; In addition to the common dataset ZD98,CH317, a new dataset Gram1253; has been built The traditional KNN classifier is improved, and the decision of KNN algorithm is completed by using Blast comparison to find the most similar sequence. A new classification and prediction algorithm is proposed: similarity ratio KNN prediction algorithm, and Jackknife test is performed on three data sets. The success rates were 93.9%, 91.5% and 92.5%, respectively. Then the Hadoop distributed computing framework is introduced to optimize the algorithm. In order to further study the prediction algorithm, the AdaBoost and Bagging algorithms are used to predict the subcellular interval of the protein sequence after integrating the KNN classifier with multiple similarity ratios. The three data sets are tested by Jackknife. The predictive success rates of AdaBoost were 94.9%, 92.4% and 93.1%, respectively. Because of the uneven interval distribution between ZD98 and CH317 data sets, the prediction accuracy of Bagging integration algorithm is lower than that of KNN algorithm, which is 89.8% and 87.7% respectively. However, the experimental results on Gram1253 show that the prediction accuracy is 92.9%. The experimental results show that AdaBoost and Bagging integrated classification prediction method is an effective method for protein subcellular interval prediction.
【学位授予单位】:南京农业大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:Q51;TP301.6
本文编号:2470952
[Abstract]:The function of protein is closely related to its subcellular interval. The prediction of subcellular interval of protein can help us to understand the functional information of protein, which is of great significance for biological research. The traditional method of obtaining protein subcellular interval information by experiment is not only time-consuming, high-cost, but also unfavorable to the localization of a large number of protein sequences, so it is necessary to find an efficient method of protein subcellular interval prediction. In this paper, the feature extraction algorithm of protein sequence is introduced, and the traditional K nearest neighbor classifier is improved. A novel protein subcellular classification prediction algorithm based on similarity ratio based on improved KNN is proposed. Through AdaBoost and Bagging integrated prediction, good experimental results have been obtained. The main work of this paper is as follows: this paper mainly introduces three feature extraction algorithms: amino acid composition, dipeptide, pseudo amino acid composition; In addition to the common dataset ZD98,CH317, a new dataset Gram1253; has been built The traditional KNN classifier is improved, and the decision of KNN algorithm is completed by using Blast comparison to find the most similar sequence. A new classification and prediction algorithm is proposed: similarity ratio KNN prediction algorithm, and Jackknife test is performed on three data sets. The success rates were 93.9%, 91.5% and 92.5%, respectively. Then the Hadoop distributed computing framework is introduced to optimize the algorithm. In order to further study the prediction algorithm, the AdaBoost and Bagging algorithms are used to predict the subcellular interval of the protein sequence after integrating the KNN classifier with multiple similarity ratios. The three data sets are tested by Jackknife. The predictive success rates of AdaBoost were 94.9%, 92.4% and 93.1%, respectively. Because of the uneven interval distribution between ZD98 and CH317 data sets, the prediction accuracy of Bagging integration algorithm is lower than that of KNN algorithm, which is 89.8% and 87.7% respectively. However, the experimental results on Gram1253 show that the prediction accuracy is 92.9%. The experimental results show that AdaBoost and Bagging integrated classification prediction method is an effective method for protein subcellular interval prediction.
【学位授予单位】:南京农业大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:Q51;TP301.6
【参考文献】
相关期刊论文 前2条
1 文学志;方巍;郑钰辉;;一种基于类Haar特征和改进AdaBoost分类器的车辆识别算法[J];电子学报;2011年05期
2 李利珍;董自梅;;基于整合蛋白质进化保守性的伪氨基酸组成成分预测蛋白质亚细胞定位(英文)[J];生物物理学报;2009年02期
相关博士学位论文 前1条
1 高青斌;蛋白质亚细胞定位预测相关问题研究[D];国防科学技术大学;2006年
相关硕士学位论文 前1条
1 陈爱平;基于Hadoop的聚类算法并行化分析及应用研究[D];电子科技大学;2012年
,本文编号:2470952
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2470952.html