基于Spark的在线欺诈检测算法设计与实现

发布时间：2018-05-26 02:55

本文选题：欺诈检测 + 不平衡学习　；参考：《浙江大学》2017年硕士论文

【摘要】：在大数据时代背景下,电子商务、第三方支付等线上业务爆发式增长,随之而来的是日益猖獗的线上欺诈案件,在线欺诈检测技术作为企业风控能力的基石,通过对业务行为建模,更加精准、高效地识别欺诈案件,为广大用户和线上平台挽回损失、规避风险,发挥着巨大的作用。由于线上欺诈案件与正常交易的极度不平衡性,在线欺诈检测需要重点解决不平衡学习问题。除此以外,随着线上业务量日益增长,在线欺诈检测系统作为业务系统的核心组件,对其性能要求也越来越严格,将大数据技术和在线欺诈检测有机结合将极大地提升企业的风控防御能力。本论文从相关技术介绍切入,详细讨论了包括分布式计算框架Spark,实时流计算组件Spark Streaming在内的大数据技术,同时介绍了在线欺诈检测研究的进展。结合大数据背景,本文提出了基于聚类的数据集自平衡构建算法和分布式资损敏感Lasso算法,将两者有机结合基于Spark分布式计算框架进行了实现,并在实际在线欺诈检测数据集上进行了相关指标的测评。本论文的主要贡献有:1)提出了一种基于聚类的数据集自平衡增量构建算法,利用增量聚类算法度量类内样本的相似度,选择类内具有代表性的多个样本点构成训练集,在能够保留时序数据信息的情况下,有效解决在线欺诈检测数据集的类内、类间不平衡等问题;2)结合在线支付欺诈检测场景,提出了分布式资损敏感Lasso算法,在大数据背景下能够高效地进行模型训练,并能有效提高在线欺诈检测模型的资损率;3)基于Spark分布式计算框架和Spark Streaming实时流处理组件,无缝集成基于聚类的数据集自平衡增量构建算法和分布式资损敏感Lasso算法,验证了上述方法在大数据背景下的在线欺诈检测场景的有效性。
[Abstract]:Under the background of big data era, e-commerce, third-party payment and other online business explosive growth, followed by the increasingly rampant online fraud cases, online fraud detection technology as the cornerstone of enterprise wind control capacity, Through the modeling of business behavior, more accurate and efficient identification of fraud cases, for the vast number of users and online platforms to recover losses, avoid risks, play a huge role. Because of the extreme imbalance between online fraud cases and normal transactions, online fraud detection needs to focus on solving the imbalance learning problem. In addition, with the increasing volume of online business, the online fraud detection system, as the core component of the business system, has become more and more stringent in its performance requirements. The combination of big data technology and online fraud detection will greatly improve the ability of wind control defense. This paper discusses the big data technology including the distributed computing framework (Spark), the real-time stream computing component (Spark Streaming), and the research progress of online fraud detection. Based on the background of big data, this paper proposes a clustering based self-balancing algorithm for data sets and a distributed loss-sensitive Lasso algorithm. The two algorithms are implemented based on the distributed computing framework of Spark. The related indexes are evaluated on the actual online fraud detection data set. The main contributions of this paper are: (1) A clustering based self-balanced incremental algorithm is proposed. Using the incremental clustering algorithm to measure the similarity of samples within a class, a training set is constructed by selecting a number of representative sample points in the class. This paper proposes a distributed loss-sensitive Lasso algorithm based on the on-line payment fraud detection scenario, which can effectively solve the problems of in-class and inter-class imbalance in online fraud detection data set. Under the background of big data, model training can be carried out efficiently, and the capital loss rate of online fraud detection model can be improved effectively. It is based on Spark distributed computing framework and Spark Streaming real-time stream processing module. The clustering based self-balanced incremental construction algorithm and the distributed capital-loss sensitive Lasso algorithm are seamlessly integrated to verify the effectiveness of the above methods in the online fraud detection scenario under the background of big data.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】