基于Hadoop的通信行业大数据分析挖掘技术研究与实现

发布时间：2019-04-12 06:20

【摘要】：随着信息技术的发展,产生的数据规模在急剧扩大,面对如此海量的数据,数据挖掘相关技术也随之发展。面对海量数据既有挑战也有机遇,如何从如此大量的数据中挖掘出有用的信息,是一项具有挑战性的任务。在通信行业存在大量的客户数据,利用大数据相关技术对这些数据进行分析挖掘,挖掘出潜在的知识,以提高服务体验是一项有意义的任务。本文在此背景设下所做的工作如下:首先对算法进行了研究和改进,利用聚类算法实现客户细分,使用决策树算法进行客户预测。传统的K-means算法需要输入聚类数目,而而对如此海量数据并不清楚数据的分布情况,这对使用此算法带来了困难,针对这些不足,本文对K-means聚类算法进行了改进,实现了一和了DGK-means算法,利用遗传算法来计算最合适的聚类数目,同时使用基于密度的思想计算遗传算法中的适应度函数,提高了算法效率和准确度。使用C4.5决策树算法构造决策树模型,使用此模型预测未知结果的数据,达到客户预测和客户挽留的目标。其次使用Hadoop平台进行大数据的分析和挖掘,设计并实现了基于Hadoop的通信行业大数据分析挖掘系统,使用HDFS对数据进行分布式存储和MapReduce编程模型对算法进行并行化计算。在算法层对算法分别进行了并行化设计,提高了效率。最后本文使用测试数据集对系统和算法的性能进行了验证,表明设计的DGK-means算法的准确度和效率相比较传统算法均得到了提高;并行化计算在集群节点数目大于2的情况下效率得到了提高,并且随着集群节点数目的增加效率提高越明显。
[Abstract]:With the development of information technology, the scale of data is expanding rapidly. In the face of such a huge amount of data, data mining technology is also developed. Faced with both challenges and opportunities, how to mine useful information from such a large amount of data is a challenging task. There is a large amount of customer data in the communication industry. It is a meaningful task to analyze and mine these data by using big data's related technology to find out the potential knowledge in order to improve the service experience. Under this background, the work done in this paper is as follows: firstly, the algorithm is studied and improved, the clustering algorithm is used to achieve customer segmentation, and the decision tree algorithm is used to predict the customer. The traditional K-means algorithm needs to input the number of clusters, but for such a large amount of data does not know the distribution of the data, which brings difficulties to use this algorithm, in view of these shortcomings, this paper has improved the K-means clustering algorithm. The one-sum DGK-means algorithm is implemented. The genetic algorithm is used to calculate the most suitable number of clusters, and the fitness function of the genetic algorithm is calculated by using the density-based idea, which improves the efficiency and accuracy of the algorithm. The C4.5 decision tree algorithm is used to construct the decision tree model. The model is used to predict the data of unknown results to achieve the goal of customer prediction and customer retention. Secondly, the Hadoop platform is used to analyze and mine big data, and the big data analysis and mining system based on Hadoop is designed and implemented. HDFS is used for distributed storage of data and MapReduce programming model is used for parallel calculation of the algorithm. In the algorithm layer, the parallel design of the algorithm is carried out to improve the efficiency. Finally, the test data set is used to verify the performance of the system and the algorithm. It is shown that the accuracy and efficiency of the designed DGK-means algorithm are improved compared with the traditional algorithm. The efficiency of parallel computing is improved when the number of cluster nodes is greater than 2, and the efficiency increases more obviously with the increase of the number of cluster nodes.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】