当前位置:主页 > 医学论文 > 生物医学论文 >

统计学习模型分析蛋白质表达对乳癌细胞增殖的作用

发布时间:2018-09-04 09:40
【摘要】:随着人们在日常生活中与有害物质的接触越来越频繁,癌症的发病率也逐渐增高。在这个大数据时代,如何在错综复杂的数据中选取有效的部分,变得十分重要。由于统计学习方法能够更好的挖掘出有用的信息,这使得它成为十分重要的研究内容。本文的研究对象为MD Anderson的一组乳癌细胞MDA-MB-231所扫描的反时相蛋白质阵列(RPPA)和细胞增殖数据。通过这些数据对线性回归、支持向量机(SVM)和随机森林模型(RF)分别进行训练,从而找到控制乳癌细胞增殖的关键蛋白质。最终把这些关键蛋白质作为癌症药物的潜在靶标。本文使用的数据波动性较大,为减少这些数据对统计效能产生的影响,首先对RPPA进行数据预处理。然后将预处理过的RPPA作为输入数据,细胞增殖作为输出数据,分别对线性回归、SVM和RF进行训练,其中在线性回归模型的应用中,提出并使用了主成分分析(PCA)与线性回归模型相结合的方法。最后通过比较三种模型的结果,得到了既具有较高精确度、又能够筛选出具有关键影响力的蛋白质组合的模型。本文结果表明,线性回归模型精确度高,SVM模型能筛选出对乳癌细胞增殖起关键作用的蛋白质组合,而RF在这两方面表现都非常好。最后,利用RF对RPPA进行分析,得到28种对乳癌细胞影响较大的蛋白质,查找文献可知,确认其中21种对乳癌细胞增殖有很大影响。
[Abstract]:As people contact with harmful substances more and more frequently in their daily life, the incidence of cancer increases gradually. In this big data era, how to select valid parts in the intricate data becomes very important. Because the statistical learning method can better excavate useful information, it becomes a very important research content. The object of this study was reverse phase protein array (RPPA) and cell proliferation data scanned by MDA-MB-231 of a group of breast cancer cells in MD Anderson. These data were used to train linear regression, support vector machine (SVM) and random forest model (RF) to find the key proteins to control the proliferation of breast cancer cells. These key proteins are eventually used as potential targets for cancer drugs. The data used in this paper are highly volatile. In order to reduce the impact of these data on statistical performance, the data preprocessing of RPPA is carried out first. Then the preprocessed RPPA is used as input data and cell proliferation is used as output data to train linear regression SVM and RF, respectively, which are used in the application of linear regression model. The method of combining principal component analysis (PCA) with linear regression model is proposed and used. Finally, by comparing the results of the three models, the model with high accuracy and the ability to screen out protein combinations with key influence is obtained. The results show that the linear regression model with high accuracy can screen out protein combinations that play a key role in the proliferation of breast cancer cells, and RF performs very well in both aspects. Finally, RF was used to analyze RPPA, and 28 kinds of proteins which had a great effect on breast cancer cells were obtained. The results showed that 21 of them had great influence on the proliferation of breast cancer cells.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:R737.9;Q811.4

【参考文献】

相关期刊论文 前1条

1 林成德;彭国兰;;随机森林在企业信用评估指标体系确定中的应用[J];厦门大学学报(自然科学版);2007年02期



本文编号:2221705

资料下载
论文发表

本文链接:https://www.wllwen.com/yixuelunwen/swyx/2221705.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户c3116***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com