神经网络模型在预测急性心肌梗死中的应用及模型预测能力的比较研究

发布时间：2018-05-11 11:01

本文选题：心血管疾病 + 急性心肌梗死　；参考：《北京协和医学院》2013年博士论文

【摘要】：目的心血管疾病是世界范围内严重危害人类健康的疾病,近年来研究显示,其发病率和死亡率在发展中国家日益增高。目前已经有很多研究探讨心肌梗死发病的危险因素并预测发病概率。预测疾病发病概率需要建立统计学模型,对于目前常规使用的统计模型预测能力有限。我们希望寻找一种更好的分析变量间更为复杂的非线性关系的数学模型,从而为中国人群急性心肌梗死的诊断和预防提供参考。神经网络模型是在模拟人脑神经组织的基础上发展起来的计算系统,是由大量处理单元通过广泛互联而构成的网络体系,它具有生物神经系统的基本特征,具有非线性映射能力、学习能力、自适应能力、容错能力、联想储存的功能,是数据挖掘方法中一类非常重要的模型。本研究的目的是构建Logistic回归模型、BP神经网络模型和Elman神经网络模型,并将常规的统计学方法与神经网络模型的方法结合起来运用到急性心肌梗死的预测中,期望能够提高疾病的预测能力。方法我们将中国人群急性心肌梗死流行学调查数据中涉及的变量分为常规变量和基因SNP位点变量。常规变量分为定性变量和定量变量,进行了变量的描述和单变量分析。对于基因SNP位点变量,进行了基因频率和基因型频率计算、哈代-温伯格平衡定律验证、趋势检验和SNP位点单体型区域的构建。之后我们构建了3种统计预测模型,常规Logistic回归模型、BP神经网络模型和Elman神经网络模型,回代数据计算ROC曲线下面积,初步比较三种模型的预测精度；而后利用随机抽样的方法将数据分为训练集和验证集,重新构建模型评价3种模型的泛化能力,利用反复抽样的方法比较三种模型的预测精度；最后我们随机模拟数据,考虑到连续型变量和离散型变量在模型中的差别,因此,我们将随机模拟分为两种情况,第一部分模拟连续型变量具有统计学意义；第二部分模拟离散型变量具有统计学意义,分别构建模型,并针对模型对变量的适应性和模型的稳定性进行研究。结果经过数据随机抽样分为预测数据集和验证数据集拟合模型比较3种模型的预测能力,结果表明10%-40%4种不同验证数据集比例情况下,BP神经网络模型ROC曲线下面积相比Logistic回归模型分别高出4.5%、3.1%、3.3%和2.9%,具有统计学意义。Elman (?)神经网络模型ROC曲线下面积相比Logistic回归模型分别高出4.2%、2.1%、2.9%和1.4%,20%和40%比例人群作为验证数据集情况下无统计学意义。BP模型ROC曲线下面积相比Elman模型4种不同验证数据集比例差别为：0.2%、0.9%、0.4%和1.6%,差别不具有统计学意义。BP神经网络模型相比常规的Logistic回归模型能够显著提高模型的泛化能力。随机模拟数据研究结果表明,第一部分模拟连续型变量具有统计学意义,3种模型的预测性能均较高；第二部分模拟离散型变量具有统计学意义,在10%-40%4种不同验证数据集比例情况下,BP神经网络模型与Elman神经网络模型ROC曲线下面积相比Logistic回归模型分别高出3.2%、2.9%、3.2%和3.1%,具有统计学意义。2种神经网络模型预测性能均显著优于Logistic回归模型。Elman模型与BP模型差别无统计学意义。结论通过本研究的实际应用结果可知：利用BP神经网络、Elman神经网络模型具有良好的预测能力、较快的运算速度、良好的稳定性,具有解决复杂的非线性关系的能力,特别是在样本量不大、离散型变量较多、非线性关系复杂的数据研究中,神经网络模型的预测性能高于Logistic回归分析,充分显示出神经网络方法的优越性和合理性。这2种神经网络方法在心脏病流行病学领域预测和评价方面的使用将具有较好的实际应用价值。
[Abstract]:objective
Cardiovascular disease is a worldwide disease which seriously endangers human health. In recent years, studies have shown that its morbidity and mortality are increasing in the developing countries. There are many studies on the risk factors of myocardial infarction and the probability of predicting the incidence of the disease. The statistical model used by the rules is limited. We hope to find a better mathematical model of the more complex nonlinear relationship between the variables, so as to provide a reference for the diagnosis and prevention of acute myocardial infarction in the Chinese population. The network system, consisting of a large number of processing units through extensive interconnection, has the basic characteristics of the biological neural system. It has the ability of nonlinear mapping, learning, self-adaptive, fault-tolerant, and associative storage. It is a very important model in the data mining method.
The purpose of this study is to construct the Logistic regression model, the BP neural network model and the Elman neural network model, and combine the conventional statistical method with the neural network model to predict the acute myocardial infarction, and expect to improve the prediction ability of the disease.
Method
We divide the variables involved in the epidemiological survey data of acute myocardial infarction in Chinese population into conventional and gene SNP loci variables. The conventional variables are divided into qualitative and quantitative variables. The variables are described and the single variable analysis is carried out. For the gene SNP locus variables, the basis frequency and genotype frequency are calculated, Hardy Weber. Verification of lattice equilibrium law, trend test and construction of haplotype region of SNP locus.
Then we construct 3 kinds of statistical prediction models, the conventional Logistic regression model, the BP neural network model and the Elman neural network model, calculate the area under the ROC curve, compare the prediction accuracy of the three models, and then divide the data into the training set and the verification set by random sampling, and re construct the 3 models of the model evaluation. The generalization ability of the type is used to compare the prediction accuracy of the three models by repeated sampling. Finally, we simulate the data randomly and take into account the difference between the continuous and discrete variables in the model. Therefore, we divide the random simulation into two cases. The first part simulates the continuous variable with statistical significance; the second part of the simulation is simulated. Discrete variables have statistical significance, build models respectively, and study the adaptability of models to variables and the stability of models.
Result
After data random sampling is divided into prediction data set and validation data set fitting model to compare the prediction ability of the 3 models, the results show that the area under the BP neural network model ROC curve is 4.5%, 3.1%, 3.3% and 2.9% higher than that of the Logistic regression model, with the statistical significance.Elman (?) deity under the proportion of 10%-40%4 different validation data sets. The area under the network model ROC curve is 4.2%, 2.1%, 2.9% and 1.4% higher than that of the Logistic regression model, and the population of 20% and 40% is not statistically significant.BP model ROC curve under the ROC curve, compared with the Elman model, the ratio of 4 different validation data sets is 0.2%, 0.9%, 0.4% and 1.6%, and the difference does not have statistical meaning. Compared with the conventional Logistic regression model, the semantic.BP neural network model can significantly improve the generalization ability of the model.
The results of random simulation data show that the first part of the simulated continuous variable has statistical significance, the prediction performance of the 3 models is high, the second part of the simulated discrete variable has statistical significance. In the case of different 10%-40%4 verification data sets, the BP neural network model and the Elman neural network model ROC curve area Compared with the Logistic regression model, 3.2%, 2.9%, 3.2% and 3.1% were higher respectively. The predictive performance of.2 neural network model was significantly better than that of the Logistic regression model, and there was no statistical difference between the.Elman model and the BP model.
conclusion
Through the practical application of this study, we can see that using the BP neural network, the Elman neural network model has good prediction ability, fast computing speed, good stability, and has the ability to solve complex nonlinear relations, especially in the data study of small sample size, more discrete variable and complex nonlinear relationship. The predictive performance of the network model is higher than the Logistic regression analysis, which fully shows the superiority and rationality of the neural network method. The 2 neural network methods will have good practical application value in the field of prediction and evaluation of the field of heart disease epidemiology.

【学位授予单位】：北京协和医学院
【学位级别】：博士
【学位授予年份】：2013
【分类号】：R542.22

【参考文献】