当前位置:主页 > 科技论文 > 自动化论文 >

基于特征融合和降维算法的蛋白质亚核定位研究

发布时间:2018-07-03 10:55

  本文选题:蛋白质亚核定位 + 融合表达 ; 参考:《云南大学》2016年硕士论文


【摘要】:随着人类基因组测序的完成,高通量测序技术逐步流行,使得蛋白质序列大量产生。对新测得序列的蛋白质功能的掌握则成为生物信息学研究的热点之一。众所周知,蛋白质需要在生物体细胞内执行其生物活动,进而得知蛋白质的亚细胞、亚核定位信息与蛋白质的功能紧密相关,并且蛋白质亚核定位信息还为遗传和癌症等方面疾病的预防、诊断与治疗提供有效的线索。然而传统的通过生物学实验的方法获取蛋白质亚核定位信息需消耗大量的时间与金钱。近年来,随着计算机科学快速地发展,利用机器学习的方法研究蛋白质亚核定位成为生物信息学研究的一个热点,并且基于机器学习的方法所开发出的定位方法预测速度快且代价较低。本文正是利用机器学习的方法对蛋白质亚核定位问题展开深入研究。首先全面地对蛋白质亚核定位的基本知识、问题的背景与意义以及研究现状进行阐述;同时对蛋白质亚核定位的主要研究内容给出详细地描述;然后不同角度地对蛋白质序列特征表达和分类器的选择进行探讨,并归结了当前蛋白质序列表达方法存有的问题;最后提出了本文研究蛋白质亚核定位的突破点。提出基于特征融合和有监督的局部保持投影的蛋白质亚核定位方法。由于传统的特征表达只局限于单一方面序列信息来提取蛋白质特征,并且基于传统的特征表达,设计分类模型时,没有分析序列表达的数据分布,使得特征表达与分类方法之间比较孤立,于是,该方法首先对具有序列互补性信息的表达进行融合,得到一种具有高效判别信息的特征融合表达;然后利用有监督的局部保持投影学习数据低维流形,对提出的融合表达降维处理,得到类间分割、类内保持的低维判别特征,依据此数据分布,选用K-近邻分类方法预测序列的亚核位置;最后该方法在两种标准数据集上进行多种对比实验均取得较高的预测精度。该方法充分利用传统序列表达包含信息的互补性,并考虑序列表达的数据分布与分类模型的关联性,使得该方法在整体预测精度上有较大的提高。但是该方法忽略了不同亚核位置蛋白质的差异性,为此提出了本文研究的另一创新点。提出基于高效的融合表达和线性判别分析的蛋白质亚核定位方法。该方法依据不同特征表达包含的序列信息不同,进而对亚核定位的贡献程度不同,以及不同亚核位置上的蛋白质的功能不同的性质,通过精细化各亚核位置上蛋白质的这些差异性,提出对不同亚核位置上的特征数据进行不同程度的融合处理,构建出包含高效判别信息的两种高维融合表达;其中,利用遗传算法求取融合表达的各亚核位置上的特征融合系数。由于得到的融合表达的维度高且融合表达包含的信息有冗余,为此,利用线性判别分析降维处理所提出的融合表达,选出亚核定位预测精度最高时的数据维度,同时开发出本章的蛋白质亚核定位分类器。在两种标准数据集上运行大量实验,结果表明提出的方法具有较高的预测精度,且分类器的性能也较高。
[Abstract]:With the completion of the sequencing of the human genome, high throughput sequencing technology is becoming popular, making a large number of protein sequences. It is one of the hotspots in the study of bioinformatics to master the protein function of the newly detected sequences. It is well known that proteins need to hold their biological activities within the cells of the organism, and then learn the subthin protein of the protein. The localization information of subnuclei is closely related to the function of protein, and the localization information of protein subnuclei provides effective clues for the prevention and treatment of diseases such as heredity and cancer. However, the traditional method of obtaining protein subnuclear location through biological experiments takes a lot of time and money. With the rapid development of computer science, using machine learning method to study the localization of protein subnuclei has become a hot spot in bioinformatics research, and the positioning method developed based on machine learning method has a fast and low cost. This paper is using the method of machine learning to develop the problem of protein subcore positioning. Firstly, the basic knowledge of protein subnucleus localization, the background and significance of the problem and the current research status are expounded, and the main contents of the protein subnucleus location are described in detail. Then the expression of protein sequence characteristics and the selection of classifier are discussed in different angles, and the results are summed up. At the end of this paper, the breakthrough point of protein subcore localization is proposed in this paper. A protein subnucleus localization method based on feature fusion and supervised local maintenance is proposed. In the traditional feature expression, when the classification model is designed, the data distribution is not analyzed, which makes the feature expression and the classification method more isolated. Therefore, the method first combines the expression of the sequence complementarity information, and obtains a feature fusion expression with efficient discriminant information; then, the method is supervised. The local preserving projection learning data is low dimensional manifold, and the proposed fusion expression reduction processing, the inter class segmentation, the low dimension distinguishing feature of the class keep in class, according to this data distribution, the K- nearest neighbor classification method is selected to predict the subkernel position of the sequence. Finally, the method has achieved a higher preview in a variety of contrast experiments on the two standard data sets. This method makes full use of the complementarity of the information contained in the traditional sequence expression, and takes into account the correlation between the data distribution and the classification model expressed in the sequence, making the method more accurate in the overall prediction accuracy. However, this method ignores the difference of different subkernel position proteins, and puts forward another creation in this paper. New points. A protein subnucleus localization method based on high efficient fusion expression and linear discriminant analysis is proposed. This method is based on different features of sequence information contained in different features, and then the contribution degree to subnuclear localization is different, as well as the different functional properties of protein in different subnuclei, by fine refining the protein subnucleus location protein. The quality of these differences, proposed to different subkernel location of the feature data in different degrees of fusion processing, and construct two kinds of high dimensional fusion expression including high efficient discriminant information; in which the genetic algorithm is used to obtain the fusion coefficients of the subkernel location of the fusion expression. The information contained in the expression is redundant. Therefore, the fusion expression proposed by the linear discriminant analysis is used to select the data dimension when the subkernel location prediction is the highest, and the protein subkernel location classifier is developed in this chapter. The large quantity experiment is run on the two standard data sets, and the results show that the proposed method is high. The precision is predicted, and the performance of the classifier is also high.
【学位授予单位】:云南大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:Q51;TP181


本文编号:2093403

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2093403.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户c1a93***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com