关联分类算法研究及其在海量慢病医疗数据挖掘中的应用

发布时间：2018-06-17 09:11

本文选题：关联分类 + Hadoop　；参考：《北京邮电大学》2016年硕士论文

【摘要】：关联分类是将关联规则挖掘和分类技术结合而产生的一种算法,它首先使用关联规则挖掘技术生成分类关联规则,然后基于这些规则构建分类器用于分类过程。与决策树、神经网络等传统的分类算法相比,它具有分类准确率高、模型可理解性强的优点,尤其适合于医疗数据挖掘等需要分类模型易于理解、易于应用的场景。高血压、心脑血管病等慢性疾病给人类的健康带来了极大危害,有必要借助数据挖掘技术建立慢病分类决策模型,进行患病预测和辅助诊断。但是,慢病数据特有的数值型属性多、属性重要性差异大的特点会导致现有关联分类技术的应用效果不理想。本文针对慢病数据的特点,提出了基于信息增益比的模糊加权关联分类算法,以提升算法的分类准确性。同时,还对单节点的关联分类算法进行并行化改造和优化来提升算法的扩展性,从而满足对海量数据高效处理的需求。论文研究工作主要围绕模糊加权关联分类算法设计,慢病数据挖掘方案设计,算法的并行化改造和性能评估等方面展开。首先,融合模糊集和信息增益比提出了能够提高分类器性能的GRWFAC算法;然后结合心血管患病风险预测场景,设计了海量慢病数据挖掘方案和模型输入输出参数;最后基于Hadoop分布式平台重新设计实现了并行化关联分类MRWFAC算法,并开展海量慢病数据挖掘实验来验证算法性能的提升。论文最终验证了慢病数据挖掘方案的可行性以及算法性能的提升。与C4.5算法和CBA算法相比,GRWFAC算法的准确率和稳定性获得提升,而并行化实现的MRWFAC算法在加速比和扩展性评估中也体现了对海量慢病数据的适应性。本课题的研究成果对于慢病防治和辅助诊断具有积极的意义。
[Abstract]:Association classification is an algorithm which combines association rule mining with classification technology. It first uses association rule mining technology to generate classification association rules, and then constructs classifier based on these rules for classification process. Compared with the traditional classification algorithms such as decision tree and neural network, it has the advantages of high classification accuracy and strong model comprehensibility. It is especially suitable for medical data mining, where classification models are easy to understand and apply. Chronic diseases such as hypertension and cardiovascular and cerebrovascular diseases have brought great harm to human health. It is necessary to establish a classification decision model of chronic diseases by using data mining technology to predict disease and assist diagnosis. However, there are many numerical attributes and great differences in the importance of attributes in slow disease data, which will lead to unsatisfactory application of existing association classification techniques. According to the characteristics of slow disease data, a fuzzy weighted association classification algorithm based on information gain ratio is proposed to improve the classification accuracy of the algorithm. At the same time, the single node association classification algorithm is parallelized and optimized to improve the scalability of the algorithm, so as to meet the demand for efficient processing of mass data. This paper mainly focuses on the design of fuzzy weighted association classification algorithm, the scheme design of slow disease data mining, the parallelization of the algorithm and the performance evaluation. Firstly, a GRWFAC algorithm which can improve the performance of classifier is proposed by combining fuzzy set and information gain ratio, and then the massive slow disease data mining scheme and the input and output parameters of the model are designed according to the forecast scenario of cardiovascular disease risk. Finally, the parallel association classification MRWFAC algorithm is redesigned based on Hadoop distributed platform, and the massive slow sickness data mining experiment is carried out to verify the performance of the algorithm. Finally, the paper verifies the feasibility of slow disease data mining and the improvement of algorithm performance. Compared with C4.5 algorithm and CBA algorithm, the accuracy and stability of GRWFAC algorithm are improved, and the parallel MRWFAC algorithm has the adaptability to mass slow disease data in speedup and scalability evaluation. The research results of this paper have positive significance for the prevention and treatment of chronic diseases and auxiliary diagnosis.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：R-05;TP311.13

【参考文献】