基于局部样本的特征选择算法在转录组数据中的应用研究

发布时间：2018-01-05 01:32

本文关键词：基于局部样本的特征选择算法在转录组数据中的应用研究　出处：《吉林大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着基因芯片和二代测序技术的发展,研究人员通过实验获得了大量的基因转录组数据,但是小样本、高维度的数据特点严重影响了提取有效特征的效率。因此,开发一个高效、鲁棒的特征选择方法,从高维基因转录组数据中提取有效的特征,对于研究者来说是非常重要的。近些年来,研究者开始使用特征选择算法对这种高维小样本的数据进行分析。随着研究的深入,研究者发现使用全部样本训练的特征选择模型不能获得最优的结果,噪声样本、样本异常值、样本不平衡分布等因素可能会导致分类准确率的下降。所以,对局部样本的研究也就尤为重要。癌症是一种异质性疾病,在癌症发展和进化的过程中,具有相同遗传特征的病人可能含有相同的分子机制。所以,通过使用具有相同遗传特征的局部样本获得更加准确的分类模型就变得愈发重要,而准确率越高的模型越能够更加准确的预测一个人是否患有癌症。因此,本文选择癌症的转录组数据的特征选择作为研究对象。在本文中,我们提出一种新颖、有效的基于局部样本的特征选择方法,这种方法能够获得更加准确的特征,从而获得更优的性能。局部样本可通过以下三个步骤获取,首先,计算任意两个样本之间的欧式距离;其次,为每个中心样本选择与其最近的若干邻居样本构建共表达网络,同时使用重启动随机游走方法形成最终的稳态概率网络,其中稳态概率可看成样本间的相似性,最终形成样本相似性网络;最后,为了选择更好的局部样本,通过设置一个确定的范围,对样本相似性网络进行划分,并且在比较了五种样本选择策略之后,获得了具有最优分类性能的局部样本集合。我们把乳腺癌、胃癌、胰腺癌、肺癌、甲状腺癌、前列腺癌等6种癌症的转录组数据作为算法测试的数据集,分别将其应用在基于局部样本的特征选择方法上,使用留一交叉验证方法评价分类性能,并且与T检验、秩和检验、最大相关最小冗余等方法进行对比。实验结果显示,本文提出的方法在六种数据集上的最大分类准确率分别是98.51%、97.27%、98.55%、100%、100%和100%,在大多数数据集上均获得非常好的效果。由此可见,我们的方法能够从不同癌症数据中提取有用的特征,进而对癌症进行分类,有很好的适用性和可应用性,同时也为医学研究者提供了参考建议。
[Abstract]:With the development of gene chip and the two generation sequencing technology, the researchers obtained gene transcriptome data by numerous experiments, but the small sample data, the characteristics of high dimension has seriously affected the efficiency of extracting effective features. Therefore, the development of an efficient and robust feature selection method, extracting effective features from high dimensional gene transcription set of data, is very important for researchers. In recent years, researchers have begun to use data feature selection algorithm based on the high dimension and small sample were analyzed. With the in-depth study, the researchers found that the use of all the features of the training sample selection model can obtain optimal results, noise samples, sample outliers sample unbalanced distribution and other factors may result in decreased classification accuracy. Therefore, the study on the local sample is particularly important. Cancer is a heterogeneous disease in cancer development The process and evolution, has the same genetic characteristics of the patient may contain the same molecular mechanisms. So, through the use of local samples with the same genetic characteristics to obtain more accurate classification model has become more and more important, and the higher the accuracy of the model is able to more accurately predict whether a person suffering from cancer. Therefore, the characteristics of this paper choose cancer transcriptome data selection as the research object. In this paper, we propose a novel and effective feature selection method based on local samples, this method can obtain more accurate characteristics, so as to obtain better performance. The local sample can be obtained, through the following three steps: first, calculate European the distance between any two samples; secondly, for each center and the nearest neighbor number of sample selection sample to construct the co expression network, at the same time using random restart if you The formation of the final steady-state probability network method, which can be regarded as the steady-state probability similarity between samples, the final formation of the sample similarity network; finally, in order to better local sample selection, by setting a certain range of sample similarity network to be divided, and the comparison of five kinds of sample selection strategy, the local sample classification performance is optimal set. We have breast cancer, gastric cancer, pancreatic cancer, lung cancer, thyroid cancer, prostate cancer transcriptome data of 6 kinds of cancer as the algorithm test data sets, respectively, which is applied in the sample selection method based on local characteristics, the performance of classification is evaluated using leave one out cross validation method and T test, rank sum test, comparison of minimum redundancy and maximum correlation method. The experimental results show that the proposed method in the six largest classification data sets are accurate 98.51%, 97.27%, 98.55%, 100%, 100% and 100%, in most of the data sets were obtained very good effect. Thus, our method can extract useful features from different cancer data, and the classification of cancer, has good applicability and applicability, but also provides suggestions for medical researchers.

【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：Q811.4;TP181

【相似文献】