基于支持向量机的赖氨酸翻译后修饰位点预测方法研究

发布时间：2018-02-28 00:47

本文关键词： 生物信息学蛋白质翻译后修饰甲基化支持向量机多标签分类　出处：《大连理工大学》2016年博士论文　论文类型：学位论文

【摘要】：蛋白质翻译后修饰是对翻译后的蛋白质进行共价加工的过程,它在调控蛋白质构象变化、活性以及功能等方面发挥着重要作用。精确地识别翻译后修饰位点是深入探究翻译后修饰分子机制的关键。近十年来,基于机器学习的蛋白质翻译后修饰位点预测研究取得了快速发展,已经成为生物信息学领域的一个研究热点。本文根据蛋白质翻译后修饰位点预测的研究现状,从蛋白质序列角度出发,利用机器学习中的支持向量机(SVM)及其改进算法,对目前翻译后修饰位点预测中存在的几个问题进行研究,主要工作概括如下：1.建立了一个蛋白质赖氨酸甲基化位点及程度预测模型iLM-2L,以解决现有的赖氨酸甲基化位点预测方法的预测精度较低,且不具备甲基化程度预测功能的问题。首先,针对现有赖氨酸甲基化位点预测方法准确率较低的问题,将有效的k-spaced氨基酸对组成编码方法应用于甲基化位点预测模型的构建,提高了甲基化位点预测准确率。其次,针对现有甲基化预测方法忽略甲基化程度预测的问题,将甲基化程度预测建模为一个多标签学习问题并利用多标签SVM算法对其进行训练。仿真实验结果表明,iLM-2L的预测性能要优于现有的5个甲基化位点预测方法：MeMo、MASA、BPB-PPMS、PMeS以及iMethyl-PseAAC。此外,iLM-2L还能够有效地进行甲基化程度预测,弥补了现有预测方法不具备甲基化程度预测功能的不足。通过对最优的k-spaced氨基酸对组成特征的分析,给出了赖氨酸甲基化修饰位点周围的潜在序列模式偏向。最后,基于iLM-2L模型,构建了甲基化位点预测服务平台,为研究人员提供在线预测服务(http://123.206.31. 171/iLM 2L/)。2.建立了一个原核生物pupylation位点预测模型IMP-PUP。针对pupylation修饰位点数据较少而导致现有预测模型性能不佳的问题,提出了一个半监督自训练SVM算法作为IMP-PUP模型的核心分类算法。所提出的自训练SVM算法可以充分挖掘PupDB数据库中未带有修饰位点标注的pupylation蛋白所隐含的位点信息,扩充了可用于模型训练的修饰位点数据,进而提高了预测性能。该算法在迭代训练过程中引入一个最小距离准则设计置信度函数来抽取可信样本,克服了原始半监督自训练SVM算法在训练过程中容易过早出现误分类情况的不足。仿真验证结果表明,IMP-PUP模型的预测性能要优于其它3个现有预测器：GPS-PUP、iPUP和pbPUP。基于IMP-PUP模型,构建了相应的在线预测平台(http://123.206.31.171/IMP_PUP/).3.建立了一个赖氨酸phosphoglycerylation位点预测模型CKSAAP_PhoglySite。首先,针对phosphoglycerylation位点预测中正负训练样本不平衡且含有噪声的问题,提出了一个模糊SVM算法。所提出的模糊SVM算法在设计样本的模糊隶属度函数时,不仅考虑样本到其类中心距离,而且考虑样本周围的紧密程度,大大提升了算法处理噪声数据的能力,并通过赋予正、负类样本分别以较大、较小的惩罚因子,较好地克服了数据不平衡性对分类器的影响。其次,为了寻找有效的编码技术来提取phosphoglycerylation位点周围的序列特征,分析并比较了氨基酸组成、二进制编码、k-spaced氨基酸对组成、位置特异性得分矩阵和二级结构这5种常用的特征对模型预测效果的影响。最后,利用所提出的模糊SVM算法结合k-spaced氨基酸对组成特征对CKSAAP_PhoglySite模型进行构建。Jackknife测试结果表明,CKSAAP_PhoglySite模型的预测准确率比现有的预测工具Phogly-PseAAC提高了14.2%。基于CKSAAP_PhoglySite模型,构建了相应的在线预测服器(http://123.206.31.171/CKSAAP_PhoglySite/)。
[Abstract]:PTMs is a process of covalent processing of protein after translation, it changes in the regulation of protein conformation, plays an important role in the activity and function. Accurate identification of post-translational modification sites is a key molecular mechanism of modification after translation. In the past ten years, machine learning of post-translational modification based on site prediction has achieved rapid development, has become a hot research topic in the field of bioinformatics. In this paper, according to the current research status of site prediction of PTMs, starting from the angle of protein sequence, using support vector machine in machine learning (SVM) and its improved algorithm, research on the current problems of post-translational modification sites in the prediction, the main works are as follows: 1.. To establish a prediction model of iLM-2L amino acid protein lysine methylation sites and the degree to solve The existing prediction of lysine methylation sites with low prediction accuracy, and do not have the function of the degree of methylation prediction. Firstly, aiming at the lysine methylation prediction method of the problem of low accuracy, the effective composition of k-spaced amino acid encoding method should be used to establish the model of methylation site prediction and improve the prediction accuracy of methylation sites. Secondly, in view of the existing prediction methods ignore methylation methylation prediction problem, the methylation level prediction model for a multi label learning problem and using the multi label algorithm to train the SVM. Simulation results show that the prediction performance of iLM-2L to 5 methylation site is better than the existing prediction methods: MeMo, MASA, BPB-PPMS, PMeS and iMethyl-PseAAC. in addition, iLM-2L also can effectively predict the degree of methylation, make up the existing prediction methods not out The lack of preparation of predictive function of methylation level by k-spaced. The optimal analysis of amino acid composition characteristics, given lysine methylation potential sequence pattern modification sites around the bias. Finally, based on the iLM-2L model, constructed the methylation prediction service platform to provide online service for researchers (http://123.206.31. 171/iLM 2L/) the.2. establishes a prediction model for IMP-PUP. of prokaryotic pupylation sites pupylation modification sites less data problem caused by poor performance of the existing prediction models, puts forward 1.5 supervised self training SVM classification algorithm as the core algorithm of IMP-PUP model. The proposed algorithm can fully exploit the self training SVM site information modification sites labeled pupylation protein implied not with the PupDB database, the expansion can be used to modify the site data for training the model, and provided High prediction performance. The algorithm introduces a minimum distance criterion design confidence function to extract credible sample in the iterative training process, overcome the original semi supervised self training in the training process of SVM algorithm is prone to premature lack of misclassification. Simulation results show that the performance of the IMP-PUP model is better than the other 3 the current Predictor: GPS-PUP, iPUP and pbPUP. based on the IMP-PUP model, establishes the forecasting platform corresponding online (http://123.206.31.171/IMP_PUP/).3. established a lysine phosphoglycerylation locus CKSAAP_PhoglySite. prediction model for prediction of phosphoglycerylation site first, positive and negative training samples are not balanced and noise problems, put forward a fuzzy SVM algorithm of fuzzy SVM algorithm. The fuzzy membership function of the sample design, considering not only the sample to its class center distance From, and consider closely surrounding the samples, greatly enhance the ability of the algorithm to deal with noise data, and by giving positive and negative samples respectively with larger, smaller penalty factor, overcomes the influence of data imbalance on the classifier. Secondly, in order to seek the effective encoding technology to extract the sequence characteristics of phosphoglycerylation sites the analysis and comparison of amino acid composition, amino acid composition of binary encoding, k-spaced, the prediction effect of position specific scoring matrix and two level structure of the 5 kinds of characteristics influence the model. Finally, using the fuzzy SVM algorithm is proposed based on amino acid composition characteristics of k-spaced CKSAAP_PhoglySite model to construct the.Jackknife test results show that the CKSAAP_PhoglySite model prediction accuracy rate than the existing improved Phogly-PseAAC prediction tool based on 14.2%. CKSAAP_PhoglySit In the e model, the corresponding online predictive server (http://123.206.31.171/CKSAAP_PhoglySite/) is built.

【学位授予单位】：大连理工大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：Q51;TP18

【相似文献】