基于机器学习的蛋白质相互作用预测精度与数据集关系的研究
发布时间:2019-03-05 15:32
【摘要】:机器学习研究计算机如何模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能,它是使计算机具有智能的根本途径。机器学习在数据挖掘、计算机视觉、生物特征识别、搜索引擎、医学诊断等领域有广泛的应用。蛋白质在细胞的生命活动中扮演着重要角色,,是细胞活性及功能的最终执行者,蛋白质功能的发挥是通过蛋白质之间的相互作用实现的,蛋白质间的相互作用是所有生物体保持正常生理功能的基础。鉴于用实验方法测定蛋白质相互作用的局限性,近年来,研究者利用机器学习的方法结合蛋白质的结构等生物学信息预测蛋白质之间的相互作用,并且提出了许多具有不同预测精度的预测方法。我们发现多数预测方法的精度存在着偏差。 本文利用人类和酵母菌的蛋白质相互作用数据集结合多个编码方法,研究利用机器学习算法预测蛋白质间的相互作用的预测精度与数据集的样本重复性间的关系。主要内容如下: 正负数据集的构造是利用机器学习方法预测蛋白质相互作用的基础。首先利用图论的邻接矩阵和最大匹配方法分别对人类和酵母菌构造两类正数据集和负数据集,进而构造机器学习使用的数据集。两类中的每个数据集都具有不同的样本重复率,用来分析预测精度与数据集的样本重复性间的关系。然后用自动协方差、局部描述符、伪氨基酸组成和三元组这四种编码方法对这构造的数据编码,用两种机器学习方法:k-近邻和随机森林,对编码后的数据进行训练和预测。最后对预测结果进行了详细分析。 实验结果表明,对每个机器学习方法和4种编码方法,正负数据集中蛋白质样本重复率不同预测的精度也不同,随着数据集中蛋白质样本的重复率由高到底的变化,对应的预测精度也随之相应变化。由此,我们得出正负数据集中样本的重复性对机器学习方法的预测精度有直接的影响,分析机器学习方法的预测结果时要考虑正负数据集中样本的重复性。
[Abstract]:Machine learning studies how computers simulate or implement human learning behavior in order to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve their own performance. It is the fundamental way to make computers intelligent. Machine learning is widely used in data mining, computer vision, biometric recognition, search engine, medical diagnosis and so on. Proteins play an important role in the life activities of cells, and they are the final executors of cell activity and function. The function of proteins is realized by the interaction between proteins. Protein-protein interactions are the basis for all organisms to maintain normal physiological functions. In view of the limitations of measuring protein interactions by experimental methods, in recent years, researchers have used machine learning methods to predict protein-protein interactions by combining biological information such as protein structure, and so on. Moreover, many prediction methods with different prediction accuracy are proposed. We find that there is a deviation in the accuracy of most prediction methods. In this paper, the relationship between the prediction accuracy of protein-protein interaction prediction by machine learning algorithm and the repeatability of the data set is studied by using the protein-protein interaction data set of human and yeast combined with multiple coding methods. The main contents are as follows: the construction of positive and negative data sets is the basis of predicting protein interaction by machine learning method. Firstly, the adjacency matrix of graph theory and the maximum matching method are used to construct two types of positive data sets and negative data sets for human and yeast respectively, and then the data sets for machine learning are constructed. Each data set in the two classes has a different sample repetition rate, which is used to analyze the relationship between the prediction accuracy and the sample repeatability of the data set. Then four coding methods, namely automatic covariance, local descriptor, pseudo-amino acid composition and triplet, are used to encode the constructed data. Two machine learning methods, k-nearest neighbor and random forest, are used to train and predict the encoded data. Finally, the prediction results are analyzed in detail. The experimental results show that for each machine learning method and the four coding methods, the different prediction accuracy of protein sample repetition rate in positive and negative data sets is different, and the repetition rate of protein samples in the data set varies from the high to the end with the change of the repetition rate of the protein samples in the data set. The corresponding prediction accuracy also changes accordingly. Therefore, it is concluded that the repeatability of positive and negative data sets has a direct effect on the prediction accuracy of machine learning methods, and the repeatability of positive and negative data sets should be taken into account when analyzing the prediction results of machine learning methods.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:Q51;TP181
本文编号:2435058
[Abstract]:Machine learning studies how computers simulate or implement human learning behavior in order to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve their own performance. It is the fundamental way to make computers intelligent. Machine learning is widely used in data mining, computer vision, biometric recognition, search engine, medical diagnosis and so on. Proteins play an important role in the life activities of cells, and they are the final executors of cell activity and function. The function of proteins is realized by the interaction between proteins. Protein-protein interactions are the basis for all organisms to maintain normal physiological functions. In view of the limitations of measuring protein interactions by experimental methods, in recent years, researchers have used machine learning methods to predict protein-protein interactions by combining biological information such as protein structure, and so on. Moreover, many prediction methods with different prediction accuracy are proposed. We find that there is a deviation in the accuracy of most prediction methods. In this paper, the relationship between the prediction accuracy of protein-protein interaction prediction by machine learning algorithm and the repeatability of the data set is studied by using the protein-protein interaction data set of human and yeast combined with multiple coding methods. The main contents are as follows: the construction of positive and negative data sets is the basis of predicting protein interaction by machine learning method. Firstly, the adjacency matrix of graph theory and the maximum matching method are used to construct two types of positive data sets and negative data sets for human and yeast respectively, and then the data sets for machine learning are constructed. Each data set in the two classes has a different sample repetition rate, which is used to analyze the relationship between the prediction accuracy and the sample repeatability of the data set. Then four coding methods, namely automatic covariance, local descriptor, pseudo-amino acid composition and triplet, are used to encode the constructed data. Two machine learning methods, k-nearest neighbor and random forest, are used to train and predict the encoded data. Finally, the prediction results are analyzed in detail. The experimental results show that for each machine learning method and the four coding methods, the different prediction accuracy of protein sample repetition rate in positive and negative data sets is different, and the repetition rate of protein samples in the data set varies from the high to the end with the change of the repetition rate of the protein samples in the data set. The corresponding prediction accuracy also changes accordingly. Therefore, it is concluded that the repeatability of positive and negative data sets has a direct effect on the prediction accuracy of machine learning methods, and the repeatability of positive and negative data sets should be taken into account when analyzing the prediction results of machine learning methods.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:Q51;TP181
【参考文献】
相关期刊论文 前2条
1 林丹玲;;度在图论中的运用[J];长江大学学报(自科版);2006年04期
2 林成德;彭国兰;;随机森林在企业信用评估指标体系确定中的应用[J];厦门大学学报(自然科学版);2007年02期
本文编号:2435058
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2435058.html