基于TCGA和PubMed数据库的高维生物医学数据的数据挖掘和特征选择研究
[Abstract]:With the rapid development of technology in the field of life sciences, especially the development of sequencing technology, biomedical data exhibits a dramatic expansion. Biomedical data not only has huge data volume, but also has the characteristics of high dimension, and the feature quantity is much larger than that of observation volume (sample size). Therefore, the appearance of these data not only brings new opportunities to researchers, but also brings new challenges. How to excavate the relationship chain of mass data has become the focus of the research work. Feature selection means that a subset of the original data is selected to represent the features of the original data, and the well-designed feature selection method enables these features to be used for subsequent data mining operations. It's no exaggeration to say that feature selection is based on data mining as yellow sand takes gold, almost any complete data mining effort avoids this step. Therefore, using feature selection technique as carrier point, this paper explores the biological informatics research methods related to high-dimensional biomedical data using two important biomedical questions as vectors. Through this study, we will put forward different features and strategies from multiple levels, and further study the characterization and prediction ability of these strategies in practical biomedical questions. The feature selection methods and results developed in this paper can provide important references for the processing and analysis of high-dimensional biomedical data. Feature selection mainly occurs in the field of machine learning and statistics, referring to the selection of closely related variables from a large number of variables for model construction. Feature selection has three main advantages: simplified model makes it easier to understand, shorten model training time, and increase model generalization ability by reducing overfitting. In practical research, most of the variables in the variable set are redundant information, and they do not cause loss of information. Therefore, feature selection is an indispensable step for dealing with massive high-dimensional biomedical data. As the 14 th century philosopher Augustan put forward "Occam Razor" Law: If not necessary, do not increase the entity. It can be said that the characteristic screening, the simplified model is the soul of mass data processing. Therefore, feature selection is a key step for the processing of mass biomedical data, which is also the starting point of this paper. At present, feature selection mainly has two kinds of methods, one is to use the topological structure of the data itself, the statistical signal is screened, and the other is the introduction of external knowledge, such as background knowledge in some specific fields. In this paper, using the data in the Cancer Genome Atlas database, the two methods are used to predict the prognosis of the tumor. First, in terms of utilizing the topological structure of data itself, we focus on the screening and discovery of gene and small RNA diagnostic markers of hepatocellular carcinoma. in one network, a relatively high degree of node is referred to as Hub We have found that these Hub nodes in these Hub nodes are more enriched with genes associated with the prognosis of HCC, indicating that these Hub nodes in complex molecular networks are more likely to be a potential feature of determining the prognosis of HCC, in combination with survival analysis techniques and studying the topological properties of prognostic-based survival-related molecules. i.e. molecular markers. Secondly, in the field of knowledge, we focus on the prediction of drug response after multiple tumor chemotherapy interventions. The main cause of tumor chemotherapy failure is due to multiple drug resistance (MDR) in the body. Drug resistance is a relatively complex process, usually due to the overexpression of the associated protein encoded by the drug-resistant gene, the chemotherapeutic agent being pumped out of the cell by the action of the energy-dependent elution pump, thereby reducing the aggregation of chemotherapeutic agents within the cells, leading to the occurrence of drug resistance in the body. For this reason, we use the gene mutation as the exposure factor, the drug resistance of the tumor is the exposure result, the relative risk ratio (RR) and the statistical significance P-value are combined to screen, and the drug resistance-related mutation gene of eight tumors is obtained as the feature set of the prognosis prediction model. Using this feature set, we used three kinds of machine learning methods to predict the drug resistance of eight kinds of tumor samples. Especially in the head and neck squamous cell carcinoma (HNSC), the area under the ROC curve (AUC) can reach 0. 980, indicating that the model which can be characterized by the knowledge in the field can be used for drug-resistant patients and drug-sensitive patients after drug intervention. Important references are provided to help the patient choose the appropriate treatment modality. In addition to drug intervention, more and more studies have shown that dietary intervention is also an important means of regulating human health, and therefore, in addition to studying the prognosis of tumor therapy, We also try to predict potential health-beneficial carbohydrates, also known as prebiotics, based on mass text data from PubMed databases. We downloaded 15 known prebiotics from PubMed database and extracted features, modeled and analyzed the predicted carbohydrate by using the feature set, and calculated a list of potential prebiotics names. This mining method can not only provide references for other data mining scholars, but also provide an important reference list for scholars studying prebiotics. Data mining is becoming more and more important with the opening of large-scale data in the field of biomedicine. Data mining method helps to understand life from system level, is an important method to study life science, and feature selection is the soul of data mining. On this basis, we will consider the whole text data and the biological expression data in future research to make some meaningful attempts to improve the human health.
【学位授予单位】:中国人民解放军军事医学科学院
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13;R318
【相似文献】
相关期刊论文 前10条
1 张葛祥;金炜东;胡来招;;满意特征选择及其应用[J];控制理论与应用;2006年01期
2 付涛;;基于特征选择的多示例学习算法研究[J];科技通报;2013年08期
3 杨打生,郭延芬;一种特征选择的信息论算法[J];内蒙古大学学报(自然科学版);2005年03期
4 张永;曹东侠;;一种高效的特征选择机制应用于入侵检测[J];甘肃科学学报;2011年03期
5 杨锦英;王碧泉;;K—W检验和熵法在单个特征选择中的应用[J];华北地震科学;1989年02期
6 刘代志,李夕海,张斌;基于序优化方法的特征选择研究[J];核电子学与探测技术;2004年06期
7 刘开第,薛俊锋,庞彦军;特征选择及其常用算法[J];河北建筑科技学院学报;2004年04期
8 喻军;孟晓玲;;一种基于层次分析的特征选择法[J];中国科技信息;2006年10期
9 南重汉;邹凌云;;基于分组重量编码和特征选择技术预测外膜蛋白[J];第三军医大学学报;2013年13期
10 苗玉杰;;差分进化在图像特征选择中的应用研究[J];科技通报;2013年08期
相关会议论文 前10条
1 靖红芳;王斌;杨雅辉;;基于类别分布的特征选择框架[A];第四届全国信息检索与内容安全学术会议论文集(上)[C];2008年
2 李长升;卢汉清;;排序学习模型中的特征选择[A];第六届和谐人机环境联合学术会议(HHME2010)、第19届全国多媒体学术会议(NCMT2010)、第6届全国人机交互学术会议(CHCI2010)、第5届全国普适计算学术会议(PCC2010)论文集[C];2010年
3 史东辉;蔡庆生;张春阳;;一种新的数据挖掘多策略方法研究[A];第十七届全国数据库学术会议论文集(研究报告篇)[C];2000年
4 张弦;;数据挖掘在农业中的应用[A];纪念中国农业工程学会成立30周年暨中国农业工程学会2009年学术年会(CSAE 2009)论文集[C];2009年
5 魏顺平;;教育数据挖掘:现状与趋势[A];信息化、工业化融合与服务创新——第十三届计算机模拟与信息技术学术会议论文集[C];2011年
6 关清平;沉培辉;;概率网络在数据挖掘上的应用[A];科技、工程与经济社会协调发展——中国科协第五届青年学术年会论文集[C];2004年
7 丁瑾;;基于Web数据挖掘的综述[A];山西省科学技术情报学会学术年会论文集[C];2004年
8 刘功申;李建华;李生红;;基于类信息的特征选择和加权方法[A];NCIRCS2004第一届全国信息检索与内容安全学术会议论文集[C];2004年
9 聂茹;田森平;;Web数据挖掘及其在电子商务中的应用[A];中南六省(区)自动化学会第24届学术年会会议论文集[C];2006年
10 李菊;王军;;数据挖掘在客户关系管理的应用[A];计算机技术与应用进展·2007——全国第18届计算机技术与应用(CACIS)学术会议论文集[C];2007年
相关重要报纸文章 前10条
1 本报记者褚宁;数据挖掘如“挖金”[N];解放日报;2002年
2 周蓉蓉;数据挖掘需要点想像力[N];计算机世界;2004年
3 □中国电信股份有限公司北京研究院 张舒博 □北京邮电大学计算机科学与技术学院 牛琨;走出数据挖掘的误区[N];人民邮电;2006年
4 《网络世界》记者 王莹;数据挖掘保险业的新蓝海[N];网络世界;2012年
5 刘俊丽;基于地理化的网络数据挖掘与分析提升投资有效性[N];人民邮电;2014年
6 本报记者 连晓东;数据挖掘:金融信息化新热点[N];中国电子报;2002年
7 本报记者 凤小华 朱仁康;“数字挖掘软件”引领中国信息化新浪潮[N];中国电子报;2003年
8 本报记者 史延廷;“成功企业数据挖掘暨数量化管理论坛”在京举办[N];中国旅游报;2002年
9 朱小宁;数据挖掘:信息化战争的基础工程[N];解放军报;2005年
10 本报记者 王小平;从“大集中”走向数据挖掘[N];金融时报;2002年
相关博士学位论文 前10条
1 李静;高维数据交互特征选择和分类研究[D];燕山大学;2015年
2 刘风;基于磁共振成像的多变量模式分析方法学与应用研究[D];电子科技大学;2014年
3 王石平;粗糙拟阵及其在高维数据降维中的应用研究[D];电子科技大学;2014年
4 代琨;基于支持向量机的网络数据特征选择技术研究[D];解放军信息工程大学;2013年
5 王爱国;微阵列基因表达数据的特征分析方法研究[D];合肥工业大学;2015年
6 杨峻山;生物组学数据的集成特征选择研究[D];深圳大学;2017年
7 王博;文本分类中特征选择技术的研究[D];国防科学技术大学;2009年
8 张明锦;基于特征选择的多变量数据分析方法及其在谱学研究中的应用[D];华东理工大学;2011年
9 高青斌;蛋白质亚细胞定位预测相关问题研究[D];国防科学技术大学;2006年
10 冯国忠;文本分类中的贝叶斯特征选择[D];东北师范大学;2011年
相关硕士学位论文 前10条
1 单光宇;基于TCGA和PubMed数据库的高维生物医学数据的数据挖掘和特征选择研究[D];中国人民解放军军事医学科学院;2017年
2 周瑞;基于支持向量机特征选择的移动通信网络问题分析[D];华南理工大学;2015年
3 张金蕾;蛋白质SUMO化修饰位点预测的数据挖掘技术研究[D];西北农林科技大学;2015年
4 陈云风;基于聚类集成技术的高铁信号故障诊断研究[D];西南交通大学;2015年
5 张斌斌;网络股评的倾向性分析[D];中央民族大学;2015年
6 季金胜;高分辨率遥感影像典型地物目标的特征选择及其稳定性研究[D];上海交通大学;2015年
7 袁玉录;基于数据分类的网络通信行为建模方法研究[D];电子科技大学;2015年
8 王虎;基于试验设计的白酒谱图特征选择及支持向量机参数优化研究[D];南京财经大学;2015年
9 王维智;基于特征提取和特征选择的级联深度学习模型研究[D];哈尔滨工业大学;2015年
10 皮阳;基于声音的生物种群识别[D];电子科技大学;2015年
,本文编号:2252235
本文链接:https://www.wllwen.com/yixuelunwen/swyx/2252235.html