基于Hadoop的物流历史数据聚类挖掘研究

发布时间：2018-04-08 13:14

本文选题：Hadoop　切入点：Canopy-Kmeans　出处：《西安工业大学》2017年硕士论文

【摘要】：随着电商、物联网、云计算等一系列新型技术的发展与应用,如今的物流行业的数据增长已不再是线性的、缓慢的,它所呈现的是海量的、复杂的、实时的与爆炸性的。显然,传统的单机存储和串行的数据挖掘技术已无法满足当前物流行业的大数据处理需求。Hadoop则依然成为了当下社会发展的新趋势,它是一个开源的分布式平台,适用于大数据集的分布式计算。近年来,这一技术在数据挖掘领域逐渐发挥出其独特的优势。而K-means聚类算法就是一种有效的大数据挖掘算法,该算法实现简单且易于使用,但其在质心点及K值的选取上仍然存在很大的盲目性和不可预见性,经常导致聚类结果出现局部最优,且在距离计算过程中存在着复杂的冗余计算,收敛速度慢,聚类精度低,缺乏并行性和可扩展性,大大降低了算法的运行效率。针对传统K-means算法不足,本课题结合了“距离三角不等式原理”和“最小最大原则”的优点,在Hadoop云计算平台上提出了一种基于双MapReduce分布式编程模型改进的Canopy-Kmeans算法,并通过社发物流公司的真实历史数据验证了本文算法的正确性。具体的研究工作如下:首先,本文详细阐述了Hadoop生态系统,对其基本组件、构造模块以及工作机制进行了深入的剖析和研究;分析了大数据挖掘过程的标准流程;对传统K-means算法的设计思路和过程进行了深入的研究,探讨了已有研究成果的优缺点。其次,为了优化K值的选中问题,在Hadoop平台上基于最小最大原则对传统的Canopy算法进行了改进,成功地解决了传统Canopy算法中人为设置K值以及区域半径T1、T2的盲目性,为K-means聚类结果的准确性提供了可靠的理论依据。再次,为了解决传统K-means算法在迭代过程中存在的大量冗余计算,本文基于三角不等式原理的优点,在K-means算法迭代计算之前,增加了距离筛选判定,从而有效地减少了大量的冗余计算;另外,为了进一步提高该算法的运行效率,本文还在引入加权聚类准则函数的基础上,增加了收敛性判定,进而提高了聚类的质量和收敛速度,降低了数据对象的误分率。最后,设计并实现了基于双MapReduce编程模型改进的Canopy-Kmeans算法。为了进一步验证本文算法设计的可行性,搭建了Hadoop集群环境,以寻找社发物流公司的关键客户群体为例进行了大量的实验。实验结果表明,设计的并行算法在聚类结果的准确性、加速比、扩展性等方面都有显著的提高。成功地解决了K值及Canopy中心点选中存在的问题,避免了迭代过程中冗余的距离计算,提高了原算法的收敛速度,并且数据规模越大、节点越多,改进的效果就越显著。
[Abstract]:With the development and application of a series of new technologies, such as e-commerce, Internet of things, cloud computing and so on, the data growth of the logistics industry is no longer linear, slow, it presents massive, complex, real-time and explosive.Obviously, the traditional single-machine storage and serial data mining technology can not meet the current logistics industry big data processing needs. Hadoop is still a new trend of social development, it is an open source distributed platform.It is suitable for the distributed computing of big data set.In recent years, this technology has gradually played its unique advantage in the field of data mining.The K-means clustering algorithm is an effective big data mining algorithm, which is simple and easy to use, but it still has great blindness and unpredictability in the selection of centroid point and K value.The clustering results often lead to local optimum, and there are complex redundant computation in the distance calculation process. The convergence speed is slow, the clustering accuracy is low, and the algorithm lacks parallelism and expansibility, which greatly reduces the running efficiency of the algorithm.Aiming at the shortage of traditional K-means algorithm, this paper combines the advantages of "distance triangle inequality principle" and "minimum maximum principle", and proposes an improved Canopy-Kmeans algorithm based on dual MapReduce distributed programming model on Hadoop cloud computing platform.The validity of this algorithm is verified by the real historical data of social development logistics company.The specific research work is as follows: first, this paper describes the Hadoop ecosystem in detail, analyzes its basic components, construction modules and working mechanism, analyzes the standard process of big data mining process;The design idea and process of traditional K-means algorithm are deeply studied, and the advantages and disadvantages of existing research results are discussed.Secondly, in order to optimize the selection of K value, the traditional Canopy algorithm is improved on the Hadoop platform based on the principle of minimum and maximum. The blindness of artificial setting K value and region radius T1T 2 in the traditional Canopy algorithm is solved successfully.It provides a reliable theoretical basis for the accuracy of K-means clustering results.Thirdly, in order to solve the large amount of redundant computation in the iterative process of the traditional K-means algorithm, based on the advantage of the triangular inequality principle, the distance filter decision is added before the iterative calculation of the K-means algorithm.In addition, in order to further improve the efficiency of the algorithm, the weighted clustering criterion function is introduced, and the convergence criterion is added.Furthermore, the clustering quality and convergence speed are improved, and the misclassification rate of data objects is reduced.Finally, an improved Canopy-Kmeans algorithm based on double MapReduce programming model is designed and implemented.In order to further verify the feasibility of the algorithm design in this paper, a Hadoop cluster environment is set up, and a large number of experiments are carried out to find the key customer group of Social Development Logistics Company as an example.The experimental results show that the proposed parallel algorithm can improve the accuracy, speedup and expansibility of the clustering results.The problem of K value and Canopy center selection is solved successfully, the redundant distance calculation during iteration is avoided, and the convergence speed of the original algorithm is improved. The larger the data scale is, the more nodes are selected, and the better the result is.
【学位授予单位】：西安工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】