针对匿名电信客户数据的流失预测模型

发布时间：2018-03-21 19:14

本文选题：流失预测　切入点：非均衡二分类　出处：《中国科学技术大学》2017年硕士论文　论文类型：学位论文

【摘要】：流失预测是电信客户关系管理的核心环节,通过数据挖掘技术建模,有效预测风险客户的流失概率,辅助运营商有针对性地设计营销策略,为科学决策提供数据支持。通过大量文献调研,可知电信客户流失预测一般作为二分类问题进行研究。现阶段该研究中面临的关键科学问题如下:第一,数据集中正负样本的非均衡分布抑制了经典数据挖掘算法的分类性能;第二,商用大数据的隐私保护策略提高了研究工作者理解数据真实意义的难度;第三,传统特征工程构建的特征总量存在上限,为模型优化设置了瓶颈。为了克服正负样本非均衡分布的问题,本文融合采样技术与集成学习理论,提出非均衡组合分类器。该模型采用有放回抽样策略构造正负样本近似均衡的数据子集,针对数据子集训练逻辑回归分类器,采用投票机制累积所有分类器的预测结果,以预测结果的平均值作为集成学习模型的最终输出。为了克服匿名特征导致的数据理解问题,本文融合数据离散化技术与独热编码技巧,提出基于深度学习构造高维特征的方法。该方法通过层次化的网络结构,抽取大量冗余特征,弥补加密数据难以利用领域知识和专家经验的缺陷。此外,借鉴决策树模型处理非均衡分类问题的优势,本文将梯度提升树模型应用于电信客户流失预测建模,进而提出基于提升树模型提取低维特征的方法。该方法融合集成学习理论与统计理论,同时实现预测性能的提升与计算复杂度的降低。实验证明,本文提出的相关算法有效提升了模型的预测性能,但是由于数据集样本容量不足,限制了部分算法的性能发挥,因此仍然存在深入研究的空间。
[Abstract]:Loss prediction is the core link of telecom customer relationship management. Through the modeling of data mining technology, it can effectively predict the probability of loss of risk customers and assist operators to design marketing strategies in a targeted way. Provide data support for scientific decision-making. Through a large number of literature research, we can see that telecom customer churn prediction is generally studied as a two-classification problem. The key scientific problems in this research are as follows: first, The disequilibrium distribution of positive and negative samples in data sets inhibits the classification performance of classical data mining algorithms. Secondly, the privacy protection strategy of commercial big data makes it more difficult for researchers to understand the real meaning of data. Third, In order to overcome the problem of non-equilibrium distribution of positive and negative samples, this paper combines sampling technology with integrated learning theory. A disequilibrium combinatorial classifier is proposed, in which a positive and negative sample approximate equalization subset is constructed by the strategy of retractable sampling, and the prediction results of all classifiers are accumulated by voting mechanism for training logical regression classifiers for data subsets. In order to overcome the problem of data understanding caused by anonymous features, the data discretization technique and the technique of single heat coding are combined in this paper, which takes the average of the predicted results as the final output of the integrated learning model. A method of constructing high dimensional features based on depth learning is proposed, in which a large number of redundant features are extracted through hierarchical network structure, which make up for the defects of domain knowledge and expert experience in encrypted data. Based on the advantage of decision tree model in dealing with the problem of disequilibrium classification, this paper applies the gradient lifting tree model to the forecasting model of telecom customer churn. Furthermore, a method of extracting low-dimensional features based on lifting tree model is proposed, which combines learning theory with statistical theory, and realizes the improvement of prediction performance and the reduction of computational complexity. The experimental results show that the proposed method can improve the prediction performance and reduce the computational complexity. The correlation algorithm proposed in this paper can effectively improve the prediction performance of the model, but there is still room for further study because of the lack of sample size in the data set, which limits the performance of some algorithms.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：F274;F626

【参考文献】