一种改进的随机森林并行分类方法在运营商大数据的应用

发布时间：2019-03-25 07:04

【摘要】：电信运营商为电信消费者提供网络服务,能够取得丰富的数据资源。为了发掘这些数据的价值,本文设计并实现了一个基于运营商大数据的二手房产中介客户分类系统,利用改进的随机森林分类方法、MapReduce并行计算框架、聚类分析等大数据处理技术,并结合数理统计、复杂网络方面的数据分析方法与网络爬虫技术,从每天的运营商通话记录中提取房产中介潜在客户并对其按照租房者、出租者、购房者、售房者以及其他等类别进行划分,以供房产中介进行精准营销。分类算法是整个系统的核心,本文提出一种改进的随机森林分类算法,包括三个改进:(1)通过数学和实验证明,对平衡数据,增加重复抽样的样本量可以有效提高准确率;(2)通过采用简单随机抽样等效替代原有的重复抽样,减少该算法的运行时间,提高系统效率;(3)采用回归分析得到不平衡度与重复抽样的定量关系为..,最终根据运营商大数据的不平衡度得到适用于本系统的重复抽样样本量。系统分为数据采集子系统、数据预处理子系统、数据分析子系统和反馈调整子系统。数据采集子系统主要负责收集房产经纪人数据。数据预处理子系统通过并行化处理技术过滤掉与房产经纪人无关的通话记录,并通过并行化处理技术从中提取出潜在的客户,以及他们的所有通话行为信息。数据分析子系统利用改进的随机森林算法对潜在客户进行分类,特别当系统处于冷启动阶段还没有训练样本时,系统利用数理统计的R语言构建可视化维度图,利用复杂网络中的分析软件Cytoscape构建可视化交互作用网络,利用机器学习的聚类分析方法对初始样本集进行分析,帮助快速获取训练样本以及梳理特征维度组合。反馈调整子系统是将后续系统运行中获得的符合条件的带标签样本加入到训练样本库中,不断对分类系统进行调整,细化分类边界让后续的分类更加准确。通过将改进的随机森林分类算法应用到基于运营商大数据的二手房产中介客户分类系统,采用最初的训练样本作为测试样本进行测试,得到分类错误率为21.1379%左右,比未改进的分类错误率(21.5274%)低0.3895%。应用了改进随机森林算法的分类系统准确率在79%左右,对房产中介销售业绩提升有促进作用。
[Abstract]:Telecom operators provide network services for telecom consumers, and can obtain rich data resources. In order to explore the value of these data, this paper designs and implements a second-hand real estate intermediary customer classification system based on operator big data, using the improved stochastic forest classification method, MapReduce parallel computing framework, Cluster analysis and other big data processing techniques, combined with mathematical statistics, complex network data analysis methods and network crawler technology, extracted real estate intermediary potential customers from daily phone records of operators and used them according to tenants and rentals. Buyers, sellers and other categories are classified for precise marketing by real estate agents. Classification algorithm is the core of the whole system, this paper proposes an improved stochastic forest classification algorithm, including three improvements: (1) through mathematical and experimental results, it is proved that increasing the sample size of repeated sampling can effectively improve the accuracy of the balanced data; (2) by replacing the original repeated sampling with simple random sampling, the running time of the algorithm is reduced and the system efficiency is improved. (3) the quantitative relationship between the degree of unbalance and repeated sampling is obtained by regression analysis. Finally, according to the unbalance degree of operator big data, the sample size of repeated sampling suitable for this system is obtained. The system is divided into data acquisition subsystem, data preprocessing subsystem, data analysis subsystem and feedback adjustment subsystem. The data collection subsystem is mainly responsible for collecting real estate agent data. The data pre-processing subsystem filters out the calls independent of the real estate agent by parallel processing technology, and extracts potential customers and all of their call behavior information from the parallel processing technology. The data analysis subsystem uses the improved stochastic forest algorithm to classify potential customers, especially when the system is in the cold start stage without training samples, the system uses R language of mathematical statistics to construct visual dimension graph. The visual interaction network is constructed by the analysis software Cytoscape in the complex network. The cluster analysis method of machine learning is used to analyze the initial sample set, which helps to quickly obtain training samples and comb the combination of feature dimensions. The feedback adjustment subsystem adds the labeled samples obtained during the follow-up system operation to the training sample database, and constantly adjusts the classification system, and refines the classification boundary to make the subsequent classification more accurate. By applying the improved stochastic forest classification algorithm to the second-hand real estate intermediary customer classification system based on operator big data, using the initial training sample as the test sample, the classification error rate is about 21.1379%. The classification error rate is 0.3895% lower than the unimproved classification error rate (21.5274%). The accuracy of the classification system based on the improved stochastic forest algorithm is about 79%, which can promote the sales performance of real estate agents.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP311.13

【相似文献】