针对匿名电信客户数据的流失预测模型
本文选题:流失预测 切入点:非均衡二分类 出处:《中国科学技术大学》2017年硕士论文 论文类型:学位论文
【摘要】:流失预测是电信客户关系管理的核心环节,通过数据挖掘技术建模,有效预测风险客户的流失概率,辅助运营商有针对性地设计营销策略,为科学决策提供数据支持。通过大量文献调研,可知电信客户流失预测一般作为二分类问题进行研究。现阶段该研究中面临的关键科学问题如下:第一,数据集中正负样本的非均衡分布抑制了经典数据挖掘算法的分类性能;第二,商用大数据的隐私保护策略提高了研究工作者理解数据真实意义的难度;第三,传统特征工程构建的特征总量存在上限,为模型优化设置了瓶颈。为了克服正负样本非均衡分布的问题,本文融合采样技术与集成学习理论,提出非均衡组合分类器。该模型采用有放回抽样策略构造正负样本近似均衡的数据子集,针对数据子集训练逻辑回归分类器,采用投票机制累积所有分类器的预测结果,以预测结果的平均值作为集成学习模型的最终输出。为了克服匿名特征导致的数据理解问题,本文融合数据离散化技术与独热编码技巧,提出基于深度学习构造高维特征的方法。该方法通过层次化的网络结构,抽取大量冗余特征,弥补加密数据难以利用领域知识和专家经验的缺陷。此外,借鉴决策树模型处理非均衡分类问题的优势,本文将梯度提升树模型应用于电信客户流失预测建模,进而提出基于提升树模型提取低维特征的方法。该方法融合集成学习理论与统计理论,同时实现预测性能的提升与计算复杂度的降低。实验证明,本文提出的相关算法有效提升了模型的预测性能,但是由于数据集样本容量不足,限制了部分算法的性能发挥,因此仍然存在深入研究的空间。
[Abstract]:Loss prediction is the core link of telecom customer relationship management. Through the modeling of data mining technology, it can effectively predict the probability of loss of risk customers and assist operators to design marketing strategies in a targeted way. Provide data support for scientific decision-making. Through a large number of literature research, we can see that telecom customer churn prediction is generally studied as a two-classification problem. The key scientific problems in this research are as follows: first, The disequilibrium distribution of positive and negative samples in data sets inhibits the classification performance of classical data mining algorithms. Secondly, the privacy protection strategy of commercial big data makes it more difficult for researchers to understand the real meaning of data. Third, In order to overcome the problem of non-equilibrium distribution of positive and negative samples, this paper combines sampling technology with integrated learning theory. A disequilibrium combinatorial classifier is proposed, in which a positive and negative sample approximate equalization subset is constructed by the strategy of retractable sampling, and the prediction results of all classifiers are accumulated by voting mechanism for training logical regression classifiers for data subsets. In order to overcome the problem of data understanding caused by anonymous features, the data discretization technique and the technique of single heat coding are combined in this paper, which takes the average of the predicted results as the final output of the integrated learning model. A method of constructing high dimensional features based on depth learning is proposed, in which a large number of redundant features are extracted through hierarchical network structure, which make up for the defects of domain knowledge and expert experience in encrypted data. Based on the advantage of decision tree model in dealing with the problem of disequilibrium classification, this paper applies the gradient lifting tree model to the forecasting model of telecom customer churn. Furthermore, a method of extracting low-dimensional features based on lifting tree model is proposed, which combines learning theory with statistical theory, and realizes the improvement of prediction performance and the reduction of computational complexity. The experimental results show that the proposed method can improve the prediction performance and reduce the computational complexity. The correlation algorithm proposed in this paper can effectively improve the prediction performance of the model, but there is still room for further study because of the lack of sample size in the data set, which limits the performance of some algorithms.
【学位授予单位】:中国科学技术大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:F274;F626
【参考文献】
相关期刊论文 前10条
1 徐子伟;王传启;王鹏;黄海;;基于分步特征提取和组合分类器的电信客户流失预测模型[J];微型机与应用;2016年13期
2 肖洁;袁嵩;谭天;;大数据时代数据隐私安全研究[J];计算机技术与发展;2016年05期
3 张润;王永滨;;机器学习及其算法和发展研究[J];中国传媒大学学报(自然科学版);2016年02期
4 卢宏涛;张秦川;;深度卷积神经网络在计算机视觉中的应用研究综述[J];数据采集与处理;2016年01期
5 丁君美;刘贵全;李慧;;改进随机森林算法在电信业客户流失预测中的应用[J];模式识别与人工智能;2015年11期
6 王建仁;李妮;段刚龙;;基于信息融合的电信客户流失预测研究[J];计算机工程与应用;2016年10期
7 潘文宇;蔡鑫;阮宜龙;姚晓辉;;运营商大数据平台的敏感信息保护研究[J];电信科学;2014年11期
8 姜晓娟;郭一娜;;基于改进聚类的电信客户流失预测分析[J];太原理工大学学报;2014年04期
9 张萌;李国喜;龚京忠;吴宝中;;Predicting configuration performance of modular product family using principal component analysis and support vector machine[J];Journal of Central South University;2014年07期
10 李勇;刘战东;张海军;;不平衡数据的集成分类算法综述[J];计算机应用研究;2014年05期
相关博士学位论文 前2条
1 钱云;非均衡数据分类算法若干应用研究[D];吉林大学;2014年
2 夏国恩;基于商务智能的客户流失预测模型与算法研究[D];西南交通大学;2007年
相关硕士学位论文 前1条
1 徐子伟;基于分步特征选择和组合分类器的电信客户流失预测模型[D];中国科学技术大学;2016年
,本文编号:1645173
本文链接:https://www.wllwen.com/guanlilunwen/yingxiaoguanlilunwen/1645173.html