基于序列特征的多位点亚细胞定位预测研究
发布时间:2018-05-31 15:05
本文选题:亚细胞定位 + 多标签学习 ; 参考:《东北师范大学》2017年硕士论文
【摘要】:蛋白质的功能与其在细胞中的定位有着密切的关系,新合成的蛋白质必须被转运到特定的细胞器(即亚细胞)中才能正确的行使其功能。因此,预测蛋白质的亚细胞定位,在确定一个未知蛋白质的功能,了解蛋白质相互作用,进而理解各种生物过程,研究一些疾病的发病机制等方面有着及其重要的意义。传统的生物实验技术如:亚细胞分离、融合绿色荧光蛋白、质谱和同位素亲和标签等可提供比较精确的亚细胞定位数据,但是这些实验多比较昂贵且耗时,单纯依靠这些实验技术来进行亚细胞定位研究代价通常比较大。近年来,随着生物数据的极大丰富,生物信息学这一交叉学科得到了迅猛发展,越来越多的研究人员热衷于利用各种计算技术来辅助解决热点生物学问题,用机器学习方法进行蛋白质亚细胞定位预测研究即是其中的热点之一,也是本文的主要研究目标。经过研究人员多年的努力,机器学习算法辅助亚细胞定位预测的研究取得了一系列很有意义的成果,各种计算方法相继产生,亚细胞定位预测的精度不断提高,亚细胞定位相关的预测平台相继出现,这些都为后续的蛋白质功能分析提供了有价值的信息。尽管研究有了很大的进展,其中仍有需要提升或改进的地方,大致分为以下三点:(1)大多数现有的方法只适用于二分类的数据,但是实际上,许多蛋白质可能有一个或多个亚细胞位置,我们需要的是能进行多标签亚细胞定位预测的分类器。(2)虽然有一些方法引入了多标签学习技术来识别有一个或者多个亚细胞位点的蛋白质,但它们的数据集中含有多标签的蛋白质数目过少。(3)一些预测分类器采用了基因本体(Gene Ontology)的方法来提高预测准确率,但是这种方法提出的特征维数太大,提取过程比较繁琐,需要有效的降维方法来进行降维。本文在对目前的蛋白质亚细胞定位预测算法进行了充分的比较研究基础上,针对现有分类器的不足,提出了相应的改进措施,并从数据集的获取、蛋白质序列特征提取方法、亚细胞定位预测算法以及预测算法的性能评估等四方面进行了详细的阐述。本文提出的方法,采用的数据集来自于被广泛认可的工具iLoc-Animal,其类别的“多样度”达到1.8922,预测总类别数达到20个;序列特征提取方法采用了氨基酸组成AAC(amino acid composition)和聚类的特征LIFT,克服了用GO来构造特征的繁琐和耗时;预测算法在比较了常用的多标签预测算法和策略基础上,最终采用了多标签K近邻(multi-label K-nearest neighbor);分类器性能测试阶段,本文采用了十折交叉验证方法,对准确率(Precision)、精确率(Accuracy)、召回率(Recall)、绝对正确率(Absolute-True)、绝对错误率(Absolute-False)等五个验证指标进行了评估,并同经典算法iLoc-Animal进行了比较。实验结果表明,本文的方法成功分类的准确度(Accuracy)为74.35%和绝对正确率(Absolute-True)为71.17%,明显高于iLoc-Animal中的准确度(62.28%)和绝对正确率(45.62%)并且,各个评价指标本文的结果也都好于iLoc-Animal。除了预测精度较高以外,本文的预测方法还有实现简单,响应速度快等特点,希望本文的工作能对当前的蛋白质亚细胞定位预测研究有启发和促进作用。
[Abstract]:The function of a protein is closely related to its location in a cell. The newly synthesized protein must be transported to a specific organelle (or subcellular) to perform its function correctly. Therefore, the prediction of the subcellular localization of proteins, the function of an unknown protein, the understanding of protein interaction, and the understanding of various kinds of proteins. Biological processes are of great significance in studying the pathogenesis of some diseases. Traditional biological experiments, such as subcellular separation, fusion of green fluorescent protein, mass spectrometry, and isotopic affinity tags, can provide more accurate subcellular location data, but these experiments are much more expensive and time-consuming and rely solely on these facts. In recent years, with the great abundance of biological data, the cross discipline of bioinformatics has developed rapidly. More and more researchers are keen to use various computational techniques to help solve hot biologic problems and use machine learning methods to carry out protein subfining. The study of cell location prediction is one of the hot spots and also the main research goal of this article. After many years of researchers' efforts, a series of meaningful results have been obtained by the research of machine learning algorithm assisted subcellular location prediction. Various calculation methods have been produced successively, the accuracy of subcellular location prediction is constantly improved, and subcellular localization has been improved. In spite of great progress, there are still three points that need to be promoted or improved: (1) most existing methods are suitable for two categories of data, but in fact, many proteins are in fact, many proteins are in fact. There may be one or more subcellular locations, and what we need is a classifier that can predict multi label subcellular localization. (2) although some methods have introduced multiple label learning techniques to identify proteins with one or more subcellular loci, the number of proteins with multiple labels is too small. (3) some preconditioning The classifier adopts the method of Gene Ontology (Gene Ontology) to improve the accuracy of prediction. However, the feature dimension of this method is too large, the extraction process is more complicated and the effective dimensionality reduction method is needed to reduce the dimension. The shortcomings of the existing classifier are given, and the corresponding improvement measures are put forward, and the four aspects, such as the acquisition of data sets, the extraction of protein sequence features, the algorithm of subcellular location prediction and the performance evaluation of the prediction algorithm, are elaborated in detail. The method proposed in this paper comes from the widely recognized tool iLoc-Animal, The "diversity" of the category has reached 1.8922 and the total number of categories is 20. The sequence feature extraction method uses the amino acid composition AAC (amino acid composition) and the clustering feature LIFT to overcome the cumbersome and time-consuming of using GO to construct characteristics. Using the multi label K nearest neighbor (multi-label K-nearest neighbor); the classifier performance testing stage, this paper uses ten fold cross validation method, the accuracy rate (Precision), the accuracy rate (Accuracy), the recall rate (Recall), the absolute correct rate (Absolute-True), the absolute error rate (Absolute-False) and other five verification indicators, and the same as the classical calculation. The results of the method iLoc-Animal are compared. The experimental results show that the accuracy of the method (Accuracy) is 74.35% and the absolute correct rate (Absolute-True) is 71.17%, which is obviously higher than the accuracy (62.28%) and the absolute correct rate (45.62%) in the iLoc-Animal, and the results of each evaluation index are better than the iLoc-Animal. except the prediction. Besides the high precision, the prediction method of this paper has the characteristics of simple realization and quick response. It is hoped that the work of this paper can enlighten and promote the current research of protein subcellular location prediction.
【学位授予单位】:东北师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q26;TP181
【参考文献】
相关期刊论文 前4条
1 郑珊珊;石卓兴;代琦;姚玉华;;蛋白质亚细胞定位预测研究进展[J];科技视界;2014年12期
2 李立奇;万瑛;;蛋白质的亚细胞定位预测研究进展[J];免疫学杂志;2009年05期
3 张松;黄波;夏学峰;孙之荣;;蛋白质亚细胞定位的生物信息学研究[J];生物化学与生物物理进展;2007年06期
4 周志华,陈世福;神经网络集成[J];计算机学报;2002年01期
相关博士学位论文 前1条
1 樊国梁;基于多类特征融合的蛋白质亚线粒体定位预测研究[D];内蒙古大学;2013年
,本文编号:1960203
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/1960203.html