基于云计算的贝叶斯算法在疾病预测中的研究与应用

发布时间：2018-03-29 11:38

本文选题：疾病预测　切入点：贝叶斯分类　出处：《中国科学技术大学》2016年硕士论文

【摘要】：疾病诊断是医学领域的重要课题。各种医疗机构积累了越来越多的就诊样本数据,人工对样本进行疾病分类预测的结果限于经验、决策能力等主观因素的影响难以避免地出现误差,其分类精度和效率有很大提升空间。中医疾病预测理论强调健康与内外环境密切关联,基于概率统计学的贝叶斯分类器的类属性联合概率很难被准确估计,基于单机内存的分类算法也无法在期望时间内处理大规模样本集。理想的分类模型能充分表达样本特征和疾病类别间的关联,提高分类效果和可扩展性。钊对以上不足,本文主要做了以下几点改进。首先,从局部学习的角度提出了一种基于余弦相似度进行实例加权改进的朴素贝叶斯分类算法(IWIMNB)。算法在训练样本集的局部构建高质量分类器,利用局部的训练样本弱化属性条件独立性假设,使用余弦相似度度量验证与训练样本的距离,并作为权值对修正的朴素贝叶斯模型进行参数训练,对比实验的结果表明IWIMNB算法可操作性强并具有更好的分类效果。其次,从结构扩展的角度考虑将关联规则应用到加权平均的1-依赖贝叶斯模型(AR-WAODE),从而考虑非公共父结点属性间依赖关系与不同AODE对分类的贡献。为了提高生成关联规则的效率,提出了一种基于矩阵剪枝的分布式频繁项集挖掘算法(DFIMA),目的是减少Apriori算法产生的无用候选项集及文件系统I/O负载,利用2-候选项集矩阵对生成(k+1)-频繁项集的计算过程进行剪枝,之后基于内存迭代计算框架Spark实现改进算法,对比实验的结果表明DFIMA能减少迭代过程中产生的无用候选项集,在加速比和可扩展性上表现良好。然后,基于Hadoop框架实现AR-WAODE分类算法(Hadoop-AR-WAODE),从而提高模型参数的训练速度。算法主要分为预处理作业、分类器的训练作业和预测作业。对比实验的结果表明,Hadoop-AR-WAODE通过考虑非公共父结点属性间依赖关系以及不同AODE对分类结果的贡献不同提高了分类模型的预测效果,在处理大规模样本集时分类效率得到有效改进。最后,将Hadoop-AR-WAODE算法应用到疾病分类预测实际问题中,以对原始样本集的初步数据分析结论为指导,设计并实现一个疾病分类模型。模型以经络值、面象舌象脉象测量值、气象数据为输入,以疾病类别为输出。对比实验的结果表明受限于疾病预测理论的不成熟,疾病分类模型的分类效果有限,但模型具有较好的处理效率与可扩展性,在疾病预测领域具有一定的参考价值。
[Abstract]:Disease diagnosis is an important subject in the field of medicine. Various medical institutions have accumulated more and more medical sample data, and the results of artificial classification and prediction of disease in samples are limited to experience. The influence of subjective factors, such as decision ability, can hardly avoid errors, and its classification accuracy and efficiency have great room for improvement. The theory of TCM disease prediction emphasizes that health is closely related to internal and external environment. It is difficult to estimate the joint probability of class attributes of Bayesian classifier based on probabilistic statistics. The classification algorithm based on single machine memory can not deal with large-scale sample set in the expected time. The ideal classification model can fully express the correlation between sample characteristics and disease categories, and improve the classification effect and scalability. The main improvements of this paper are as follows. Firstly, from the point of view of local learning, an improved naive Bayesian classification algorithm based on cosine similarity is proposed. The algorithm constructs a high quality classifier in the local training sample set. Using the local training samples to weaken the conditional independence hypothesis of attributes, using cosine similarity to measure the distance between the training samples and the training samples, and training the modified naive Bayes model as weights, the parameters of the modified naive Bayes model are trained. The results of comparative experiments show that the IWIMNB algorithm is more operable and has better classification effect. Secondly, From the point of view of structure extension, this paper considers the application of association rules to the weighted average 1-dependent Bayesian model (AR-WAODEN), so as to consider the dependencies between attributes of non-common parent nodes and the contribution of different AODE to classification. In order to improve the efficiency of generating association rules, A distributed frequent itemset mining algorithm based on matrix pruning is proposed in this paper, which aims to reduce the useless candidate set generated by the Apriori algorithm and the file system I / O load. The 2-candidate itemset matrix is used to prune the computing process of generating the k-1- frequent itemsets, and then an improved algorithm is implemented based on the memory iterative computing framework Spark. The results of comparison experiments show that DFIMA can reduce the useless candidate itemsets generated in the iterative process. Then, the AR-WAODE classification algorithm based on Hadoop framework is implemented to improve the training speed of the model parameters. The algorithm is divided into preprocessing jobs, and the algorithm is based on the Hadoop framework to implement the Hadoop-AR-WAODEG algorithm, which can improve the training speed of the model parameters. The results of the comparative experiments show that Hadoop-AR-WAODE improves the prediction effect of the classification model by considering the dependencies between the attributes of non-common parent nodes and the contribution of different AODE to the classification results. The classification efficiency is improved effectively when dealing with large-scale sample sets. Finally, the Hadoop-AR-WAODE algorithm is applied to the actual problem of disease classification and prediction, which is guided by the preliminary data analysis conclusion of the original sample set. A disease classification model is designed and implemented. The model is based on meridian value, tongue image pulse value, meteorological data and disease type. The results of comparative experiments show that the model is limited by the immaturity of disease prediction theory. The classification effect of the disease classification model is limited, but the model has better processing efficiency and expansibility, and it has certain reference value in the field of disease prediction.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：O212.8

【相似文献】