粗糙集理论处理海量电子病历的研究与应用

发布时间：2018-05-31 21:33

本文选题：知识挖掘 + 粗糙集理论　；参考：《浙江理工大学》2017年硕士论文

【摘要】：随着智慧医疗的兴起,大量的医疗数据资源被整合到一起。医疗大数据作为一笔宝贵的财富,它的知识挖掘已经成为当前学术领域的一个研究重点。由于数据量和冗余属性的增多,使得知识挖掘出现困难。如何对海量医疗数据实现有效的降维,提高知识挖掘的效率,是本文的研究方向。粗糙集理论在探究不完备数据、不准确知识表述、概括、学习等方面十分强大,属性约简是其主要应用之一。本文总结常用粗糙集属性约简算法存在的问题,提出粗糙集属性约简与禁忌搜索算法相结合的优化策略以及并行化方案,并利用仿真实验和疾病分类实验对算法性能进行验证,不仅给约简算法的改进提供了很好的思路,同时为大数据集的高效处理提供了可能。具体的研究内容如下:(1)通过查阅相关国内外文献,对常见的粗糙集属性约简算法进行分析,总结出各算法之间存在的问题,确定了本文研究的主要内容。(2)针对粗糙集理论和禁忌搜索算法的特点,提出禁忌搜索属性约简算法。首先描述算法的组成,包括解的表示形式、解精度度量、禁忌列表、产生邻近候选解、广泛性和集中性模式,然后介绍算法的整个实现流程。同时为了提高禁忌搜索的属性约简算法的扩展性,提出了禁忌搜索的属性约简算法的并行化方案。(3)为了测试禁忌搜索属性约简算法的基本性能,以UCI数据集作为实验数据,利用本文提出的算法和几种常见的属性约简算法进行仿真实验,根据实验结果,分别从可行性、稳定性、约简效果等方面对各算法进行对比分析。(4)为了测试禁忌搜索属性约简算法的有效性,搭建Hadoop实验环境,以海量电子病历作为实验数据,在数据预处理阶段使用传统的四种属性约简算法和本文提出的基于禁忌搜索的属性约简算法进行属性约简,在分类阶段使用朴素贝叶斯分类算法构造5种疾病分类器。通过疾病分类实验,对基于禁忌搜索属性约简算法的有效性进行证明。
[Abstract]:With the rise of smart medicine, a large number of medical data resources have been integrated. Medical big data as a valuable asset, its knowledge mining has become a research focus in the current academic field. Knowledge mining is difficult because of the increase of data and redundant attributes. How to reduce dimensionality effectively and improve the efficiency of knowledge mining is the research direction of this paper. Rough set theory is very powerful in exploring incomplete data, inaccurate knowledge representation, generalization, learning and so on. Attribute reduction is one of its main applications. In this paper, the problems of attribute reduction algorithms in rough sets are summarized, and the optimization strategy and parallelization scheme combining attribute reduction in rough sets with Tabu search algorithm are put forward, and the performance of the algorithm is verified by simulation experiments and disease classification experiments. It not only provides a good idea for the improvement of the reduction algorithm, but also provides the possibility for the efficient processing of big data sets. The specific research contents are as follows: (1) by referring to the relevant domestic and foreign literature, the common attribute reduction algorithm of rough set is analyzed, and the problems among the algorithms are summarized. According to the characteristics of rough set theory and Tabu search algorithm, a Tabu search attribute reduction algorithm is proposed. First, the composition of the algorithm is described, including the representation of the solution, the measurement of solution precision, the Tabu list, the generation of adjacent candidate solutions, extensiveness and centralized mode, and then the whole implementation process of the algorithm is introduced. In order to improve the expansibility of Tabu search attribute reduction algorithm, a parallelization scheme of Tabu search attribute reduction algorithm is proposed. In order to test the basic performance of Tabu search attribute reduction algorithm, UCI dataset is used as experimental data. By using the proposed algorithm and several common attribute reduction algorithms for simulation experiments, according to the experimental results, respectively, from the feasibility, stability, In order to test the effectiveness of Tabu search attribute reduction algorithm, a Hadoop experimental environment is built, and a large number of electronic medical records are used as experimental data. In the stage of data preprocessing, the traditional four attribute reduction algorithms and the Tabu search-based attribute reduction algorithm are used to reduce the attributes. In the classification stage, the naive Bayes classification algorithm is used to construct five kinds of disease classifiers. The effectiveness of attribute reduction algorithm based on Tabu search is proved by disease classification experiment.
【学位授予单位】：浙江理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R-05;TP18

【参考文献】