基于Relief算法的siRNA特征选择研究

发布时间：2018-03-23 01:32

本文选题：siRNA　切入点：siRNA干扰效率　出处：《吉林大学》2017年硕士论文　论文类型：学位论文

【摘要】：RNA干扰(Ribonucleic Acid interference)是通过将双链RNA导入生物体内,使目标基因出现表达沉默的一种生物技术。设计高抑制率的siRNA是RNA干扰技术的重要前提条件。由于完全依靠生物实验的方法来设计高效的siRNA,投入生物实验资金高、花费时间较长、效率低下,所以通过计算机信息技术来先行优化高抑制率的siRNA设计,是一种RNA干扰技术的可靠途径。借助生物信息技术的siRNA设计是对已有的实验数据集的机器学习并构建预测模型,用户输入靶标m RNA序列,输出候选的高抑制率的siRNA序列,然后只需进行若干次的生物实验验证。目前已有一些siRNA预测软件,但大多数都是仅基于siRNA自身的序列特征,导致预测的准确性不高;有些软件虽然特征集选择较全面,但是没有先进行特征选择这样一个重要的“数据预处理”过程,导致构建预测模型的程序运行非常耗时,而且准确性也会较低。在现实机器学习任务中,获得数据之后通常先进行特征选择很有必要,在后阶段的训练学习器时也会提高程序的运行效率。过滤式特征选择是先对数据集进行特征选择,然后再进行训练学习器的步骤,这种特征选择方法的过程与后续学习器无关。过滤式特征选择算法在评价特征时,通过对数据的所有特征进行相应权重的评分,并且此过程中不会通过构建模型来完成。在对特征集给出相应的权重评分之后,权重值小于设定的阈值的特征将会被移除,高于设定阈值的部分特征会被保留,并接着被用以进行特征分析或者分类处理、构建特征关系模型。本文对目前常用的siRNA的107个特征,根据实验数据集的实际分布,合理设计了Relief特征选择算法的具体流程。实验结果选择出了88个相关特征;移除了19个无关特征。我们用88个相关特征训练随机森林预测模型,10折交叉验证的相关系数从0.629提高到0.640,同时也提高了构建随机森林预测模型的效率、降低了siRNA软件运行的时间复杂度。本文还得到siRNA抑制率和siRNA双链5’端的能量差在统计上有明显的正相关关系,即siRNA双链5’端的能量差越高,siRNA的抑制率越高;相反,siRNA双链5’端的能量差越低,siRNA的抑制率越低。之后我们在Dieter Huesken数据集进行了统计分析,结果为:(1)反义链5’端的第1位置应该是A或者U,非G、C;(2)第2位应该是A或者U,非G、C;(3)第7位应该是非C;(4)第14位应该是非G;
[Abstract]:RNA interferes with ribonucleic Acid interference by introducing double-stranded RNA into organisms. The design of siRNA with high inhibition rate is an important precondition of RNA interference technology. Because it completely depends on the method of biological experiment to design highly efficient siRNAs, it has a high investment in biological experiments. It takes a long time and is inefficient, so computer information technology is used to optimize the siRNA design with high inhibition rate. The design of siRNA with the help of bioinformatics is to learn from the existing experimental data sets and build a prediction model. The user inputs the target m RNA sequence and outputs candidate siRNA sequences with high suppression rate. At present, there are some siRNA prediction software, but most of them are only based on the sequence features of siRNA itself, which leads to the low accuracy of prediction. But without feature selection as an important "data preprocessing" process, the program that builds the prediction model is time-consuming and less accurate. It is necessary to select features first after obtaining data, and it will also improve the efficiency of the program when training the learner in the later stage. Filtering feature selection is the step of feature selection for the data set first and then training the learner. The process of this feature selection method is independent of the follower. When evaluating the features, the filtering feature selection algorithm scores the corresponding weights on all the features of the data. After giving the corresponding weight score to the feature set, the feature whose weight value is less than the set threshold will be removed, and some features above the set threshold will be preserved. Then it is used for feature analysis or classification to construct the feature relationship model. In this paper, according to the actual distribution of the experimental data set, the 107 features of siRNA, which are commonly used at present, are analyzed. The specific flow chart of the Relief feature selection algorithm is designed reasonably and 88 related features are selected from the experimental results. We use 88 correlation features to train the random forest prediction model from 0.629 to 0.640, and improve the efficiency of constructing the stochastic forest prediction model. The time complexity of siRNA software is reduced, and the statistical positive correlation between the inhibition rate of siRNA and the energy difference at the 5 'end of siRNA double strand is obtained, that is, the higher the energy difference of siRNA double strand 5' terminal is, the higher the inhibition rate of siRNA is. On the contrary, the lower the energy difference at the 5'end of siRNA is, the lower the inhibition rate of siRNA is. The results were as follows: (1) the first position of the 5'terminal of the antisense chain should be A or U, the second position should be A or U, and the second position should be A or U, and the seventh position should be non-Cf4) and the 14th position should be non-G;
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：Q78;TP181

【参考文献】