基于Hadoop的通信行业大数据分析挖掘技术研究与实现
发布时间:2019-04-12 06:20
【摘要】:随着信息技术的发展,产生的数据规模在急剧扩大,面对如此海量的数据,数据挖掘相关技术也随之发展。面对海量数据既有挑战也有机遇,如何从如此大量的数据中挖掘出有用的信息,是一项具有挑战性的任务。在通信行业存在大量的客户数据,利用大数据相关技术对这些数据进行分析挖掘,挖掘出潜在的知识,以提高服务体验是一项有意义的任务。本文在此背景设下所做的工作如下:首先对算法进行了研究和改进,利用聚类算法实现客户细分,使用决策树算法进行客户预测。传统的K-means算法需要输入聚类数目,而而对如此海量数据并不清楚数据的分布情况,这对使用此算法带来了困难,针对这些不足,本文对K-means聚类算法进行了改进,实现了一和了DGK-means算法,利用遗传算法来计算最合适的聚类数目,同时使用基于密度的思想计算遗传算法中的适应度函数,提高了算法效率和准确度。使用C4.5决策树算法构造决策树模型,使用此模型预测未知结果的数据,达到客户预测和客户挽留的目标。其次使用Hadoop平台进行大数据的分析和挖掘,设计并实现了基于Hadoop的通信行业大数据分析挖掘系统,使用HDFS对数据进行分布式存储和MapReduce编程模型对算法进行并行化计算。在算法层对算法分别进行了并行化设计,提高了效率。最后本文使用测试数据集对系统和算法的性能进行了验证,表明设计的DGK-means算法的准确度和效率相比较传统算法均得到了提高;并行化计算在集群节点数目大于2的情况下效率得到了提高,并且随着集群节点数目的增加效率提高越明显。
[Abstract]:With the development of information technology, the scale of data is expanding rapidly. In the face of such a huge amount of data, data mining technology is also developed. Faced with both challenges and opportunities, how to mine useful information from such a large amount of data is a challenging task. There is a large amount of customer data in the communication industry. It is a meaningful task to analyze and mine these data by using big data's related technology to find out the potential knowledge in order to improve the service experience. Under this background, the work done in this paper is as follows: firstly, the algorithm is studied and improved, the clustering algorithm is used to achieve customer segmentation, and the decision tree algorithm is used to predict the customer. The traditional K-means algorithm needs to input the number of clusters, but for such a large amount of data does not know the distribution of the data, which brings difficulties to use this algorithm, in view of these shortcomings, this paper has improved the K-means clustering algorithm. The one-sum DGK-means algorithm is implemented. The genetic algorithm is used to calculate the most suitable number of clusters, and the fitness function of the genetic algorithm is calculated by using the density-based idea, which improves the efficiency and accuracy of the algorithm. The C4.5 decision tree algorithm is used to construct the decision tree model. The model is used to predict the data of unknown results to achieve the goal of customer prediction and customer retention. Secondly, the Hadoop platform is used to analyze and mine big data, and the big data analysis and mining system based on Hadoop is designed and implemented. HDFS is used for distributed storage of data and MapReduce programming model is used for parallel calculation of the algorithm. In the algorithm layer, the parallel design of the algorithm is carried out to improve the efficiency. Finally, the test data set is used to verify the performance of the system and the algorithm. It is shown that the accuracy and efficiency of the designed DGK-means algorithm are improved compared with the traditional algorithm. The efficiency of parallel computing is improved when the number of cluster nodes is greater than 2, and the efficiency increases more obviously with the increase of the number of cluster nodes.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
本文编号:2456765
[Abstract]:With the development of information technology, the scale of data is expanding rapidly. In the face of such a huge amount of data, data mining technology is also developed. Faced with both challenges and opportunities, how to mine useful information from such a large amount of data is a challenging task. There is a large amount of customer data in the communication industry. It is a meaningful task to analyze and mine these data by using big data's related technology to find out the potential knowledge in order to improve the service experience. Under this background, the work done in this paper is as follows: firstly, the algorithm is studied and improved, the clustering algorithm is used to achieve customer segmentation, and the decision tree algorithm is used to predict the customer. The traditional K-means algorithm needs to input the number of clusters, but for such a large amount of data does not know the distribution of the data, which brings difficulties to use this algorithm, in view of these shortcomings, this paper has improved the K-means clustering algorithm. The one-sum DGK-means algorithm is implemented. The genetic algorithm is used to calculate the most suitable number of clusters, and the fitness function of the genetic algorithm is calculated by using the density-based idea, which improves the efficiency and accuracy of the algorithm. The C4.5 decision tree algorithm is used to construct the decision tree model. The model is used to predict the data of unknown results to achieve the goal of customer prediction and customer retention. Secondly, the Hadoop platform is used to analyze and mine big data, and the big data analysis and mining system based on Hadoop is designed and implemented. HDFS is used for distributed storage of data and MapReduce programming model is used for parallel calculation of the algorithm. In the algorithm layer, the parallel design of the algorithm is carried out to improve the efficiency. Finally, the test data set is used to verify the performance of the system and the algorithm. It is shown that the accuracy and efficiency of the designed DGK-means algorithm are improved compared with the traditional algorithm. The efficiency of parallel computing is improved when the number of cluster nodes is greater than 2, and the efficiency increases more obviously with the increase of the number of cluster nodes.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
【参考文献】
相关期刊论文 前10条
1 牛怡晗;海沫;;Hadoop平台下Mahout聚类算法的比较研究[J];计算机科学;2015年S1期
2 张引;陈敏;廖小飞;;大数据应用的现状与展望[J];计算机研究与发展;2013年S2期
3 王元卓;靳小龙;程学旗;;网络大数据:现状与展望[J];计算机学报;2013年06期
4 张石磊;武装;;一种基于Hadoop云计算平台的聚类算法优化的研究[J];计算机科学;2012年S2期
5 彭凯;秦永彬;许道云;;应用因子分析和K-MEANS聚类的客户分群建模[J];计算机科学;2011年05期
6 山拜·达拉拜;曹红丽;尤努斯·艾沙;;基于遗传算法的K-means初始化EM算法及聚类应用[J];现代电子技术;2010年15期
7 雷小锋;谢昆青;林帆;夏征义;;一种基于K-Means局部最优性的高效聚类算法[J];软件学报;2008年07期
8 刘光远;苑森淼;董立岩;;数据挖掘方法在用户流失预测分析中的应用[J];计算机工程与应用;2007年09期
9 张宾;贺昌政;;自组织数据挖掘方法研究综述[J];哈尔滨工业大学学报;2006年10期
10 吴志勇;吴跃;;数据挖掘在电信业中的应用研究[J];计算机应用;2005年S1期
相关硕士学位论文 前1条
1 黎光谱;改进K-Means聚类算法在基于Hadoop平台的图像检索系统中的研究与实现[D];厦门大学;2014年
,本文编号:2456765
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2456765.html