基于Hadoop平台的医疗保险欺诈检测的研究与应用

发布时间：2018-06-09 13:04

本文选题：聚类 + 分类　；参考：《电子科技大学》2017年硕士论文

【摘要】：随着我国医疗与经济水平的进一步提高,我国医疗保险覆盖面已非常广,老百姓享受到了医保政策带来的真切好处。与之相对的,医保基金滥用的情况也有愈演愈烈的趋势,越来越多的基金被套取,打击非法欺诈行为势在必行。目前,医保经办机构主要利用规则系统对结算信息进行审核,规则依赖于少数指标,由于规则的不完善性与更新的滞后性使得相对不变的规则很容易被精心伪造的数据欺骗,利用计算机技术辅助审查迫在眉睫。本文分析医保数据特点,使用数据挖掘技术建立了一套欺诈检测的流程,并结合业务系统,实现了医保大数据欺诈检测与审核,主要内容如下:1.原始数据的特征工程处理。由于历史原因,现有数据集存在诸多瑕疵,首先对原始数据利用特征工程进行了处理,包括清除噪声数据,补全缺失值,结合实际业务流程提取特征等步骤。2.基于DBSCAN的粗粒度欺诈筛查。根据数据极度不平衡的特点,研究无监督算法在欺诈检测中的应用,主要对比了各种聚类算法对数据集应用的效果,并结合标签信息拟定了使用DBSCAN算法识别异常群簇。3.基于密度抽样与随机森林的精准欺诈检测。在聚类划分异常群体的基础上,提出一种基于密度的抽样方法对数据进行再平衡,并在随机森林算法中利用抽样信息对子分类器进行选择集成,分类与聚类算法的结合使用使得准确性大幅提高,最终形成完整的欺诈检测框架。4.基于Hadoop平台的并行化实现。针对大规模数据的场景提出了 DBSCAN与随机森林的并行化算法,并在Hadoop平台上使用Map-Reduce进行了实现,完成了一个欺诈检测与审核系统。本文将数据挖掘技术应用到医保异常检测领域,其创新之处在于不再局限于针对特定欺诈场景进行建模,使得其能识别出一些较为罕见的数据,具有更强的泛用性;以局部密度为纽带,提出了一种基于密度的抽样方法,将DBSCAN算法与随机森林算法结合使用,在保证高准确率的同时有效地控制了过拟合;在实现并行化算法的同时提出了一种高维数据的划分方法,体现了负载均衡的思想。
[Abstract]:With the further improvement of medical and economic level in China, the coverage of medical insurance in China has been very wide, and the common people enjoy the real benefits of medical insurance policy. On the other hand, the abuse of medical insurance fund is becoming more and more serious, and more funds are withdrawn, so it is imperative to crack down on illegal fraud. At present, medical insurance agencies mainly use the rule system to audit the settlement information, and the rules depend on a few indicators. Due to the imperfections of the rules and the lag of updating, the relatively unchanged rules are easy to be deceived by carefully forged data. The use of computer technology to assist the examination is imminent. This paper analyzes the characteristics of medical insurance data, establishes a set of process of fraud detection by using data mining technology, and realizes the fraud detection and audit of medical insurance big data by combining business system. The main contents are as follows: 1. Feature engineering processing of raw data. Because of the historical reasons, there are many defects in the existing data sets. Firstly, the original data utilization feature engineering is processed, including removing the noise data, making up the missing value, and extracting the features according to the actual business process. Coarse granularity fraud screening based on DBSCAN. According to the characteristics of extremely unbalanced data, the application of unsupervised algorithm in fraud detection is studied. The effects of various clustering algorithms on the application of data sets are compared, and the DBSCAN algorithm is used to identify abnormal cluster. 3. Precision fraud detection based on density sampling and random forest. On the basis of clustering and dividing abnormal population, a density-based sampling method is proposed to rebalance the data, and the sampling information is used to select and integrate the sub-classifiers in the random forest algorithm. With the combination of classification and clustering, the accuracy is greatly improved, and a complete fraud detection framework. 4. Parallel implementation based on Hadoop platform. A parallel algorithm of DBSCAN and random forest is proposed for large-scale data scene. A fraud detection and verification system is implemented on Hadoop platform using Map-Reduce. In this paper, data mining technology is applied to the field of medical insurance anomaly detection. Its innovation is that it is no longer limited to the modeling of specific fraud scenarios, so that it can identify some rare data and have more universal use. Based on local density, a density-based sampling method is proposed, which combines DBSCAN algorithm with random forest algorithm to ensure high accuracy and effectively control over-fitting. At the same time, a high dimensional data partition method is proposed, which embodies the idea of load balancing.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：F842.684;TP311.13

【参考文献】