当前位置:主页 > 科技论文 > 软件论文 >

基于Spark的谱聚类算法及其在QAR数据中的应用

发布时间:2018-03-23 15:06

  本文选题:Spark 切入点:谱聚类 出处:《中国民航大学》2017年硕士论文


【摘要】:谱聚类(Spectral Clustering,SC)是一种基于图论的聚类方法,相比传统聚类算法,在聚类效果上表现得更加优秀。因其将聚类问题转化成最优子图划分问题,所以能够在任意样本空间中进行聚类,并且算法收敛于全局最优解。该算法的基本思想是将样本数据作为图的顶点,样本数据之间的相似性作为图的加权边,利用该加权无向图的拉普拉斯矩阵找到一种图划分方法,使得子图内边的权重较大,而子图之间边的权重较小。然而,在对大规模数据集进行谱聚类过程中出现如下问题:第一,聚类对机器内存容量的需求超出了单一计算机的硬件能力;第二,聚类时间过长。如何在大规模数据集上使用谱聚类算法进行聚类分析是值得研究的问题。基于分布式平台Hadoop、Spark的谱聚类算法成为处理大规模数据的可行方案,本文研究的主要工作如下:首先,针对谱聚类算法无法处理大规模数据的问题提出了基于Spark的谱聚类算法解决方案。利用Spark GraphX的并行图计算优势分析样本数据之间的相似性,进而得到图的拉普拉斯矩阵,然后利用并行化Lanczos算法将拉普拉斯矩阵转换为三对角矩阵,求取三对角矩阵前K个特征向量,最后采用并行化K-means算法对K个特征向量按列构成的数据进行聚类。其次,构建了基于Hive的QAR数据仓库。为了能够更加直观展示QAR数据在分布式文件系统中的组织存储,本文以Hadoop平台为基础,构建了HDFS可视化系统。并在此基础上,对基于Hive的QAR数据仓库的总体架构及存储结构进行描述,实验表明,该数据仓库能够满足对海量QAR数据的存储及查询需求。最后,以“空中颠簸”事件为例,在某航空公司真实的QAR数据上进行并行化谱聚类分析。实验表明,在保证上述QAR数据仓库能够满足快速查询需求的同时,谱聚类算法能够为QAR数据分析提供有效的技术支持。
[Abstract]:Spectral clustering algorithm (SCS) is a kind of clustering method based on graph theory, which is more effective than the traditional clustering algorithm. Because it transforms the clustering problem into the optimal subgraph partition problem, it can be clustered in any sample space. The algorithm converges to the global optimal solution. The basic idea of the algorithm is to take the sample data as the vertex of the graph, the similarity between the sample data as the weighted edge of the graph, and to find a graph partition method using the Laplace matrix of the weighted undirected graph. The inner edge of a subgraph is more weighted than the edge of a subgraph. However, in the process of spectral clustering of large scale data sets, the following problems arise: first, Clustering demand for machine memory capacity is beyond the hardware capabilities of a single computer; second, The clustering time is too long. It is worth studying how to use spectral clustering algorithm in large-scale data sets. The spectral clustering algorithm based on Hadoop Spark, a distributed platform, has become a feasible scheme for dealing with large-scale data. The main work of this paper is as follows: firstly, a spectral clustering algorithm based on Spark is proposed to solve the problem that the spectral clustering algorithm can not deal with large scale data. The similarity of sample data is analyzed by using the parallel graph of Spark GraphX. Then the Laplacian matrix of the graph is obtained, and then the Laplace matrix is transformed into a tridiagonal matrix by using the parallelization Lanczos algorithm, and the K eigenvectors before the tridiagonal matrix are obtained. Finally, the parallel K-means algorithm is used to cluster the data composed by columns of K feature vectors. Secondly, the QAR data warehouse based on Hive is constructed. In order to show the organization and storage of QAR data in distributed file system more intuitively. Based on the Hadoop platform, a HDFS visualization system is constructed in this paper. On this basis, the overall architecture and storage structure of QAR data warehouse based on Hive are described. The experimental results show that, The data warehouse can meet the requirement of storing and querying massive QAR data. Finally, taking the "air turbulence" event as an example, parallel spectral clustering analysis is carried out on the real QAR data of an airline. The experimental results show that, At the same time, the spectral clustering algorithm can provide effective technical support for QAR data analysis.
【学位授予单位】:中国民航大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13

【参考文献】

相关期刊论文 前10条

1 孙瑞山;杨绎煊;;航班起飞阶段QAR关键参数提取研究[J];综合运输;2015年09期

2 冯兴杰;赵杰;;基于MapReduce的H-mine算法[J];计算机应用研究;2016年03期

3 杨艺芳;王宇平;;基于核模糊相似度度量的谱聚类算法[J];仪器仪表学报;2015年07期

4 孙瑞山;杨绎煊;汪磊;;QAR数据在飞行安全评价中的应用[J];中国安全科学学报;2015年07期

5 王有为;王伟平;孟丹;;基于统计方法的Hive数据仓库查询优化实现[J];计算机研究与发展;2015年06期

6 张鲁飞;郝子宇;陈左宁;;基于矩阵计算的并行谱聚类方法[J];计算机科学与探索;2015年10期

7 杨慧;王丽婧;;基于聚类和拟合的QAR数据离群点检测算法[J];计算机工程与设计;2015年01期

8 杨慧;赵兰草;;基于FP-Tree的QAR数据故障检测研究[J];计算机应用与软件;2014年10期

9 王兴良;王立宏;武栓虎;;谱聚类中选取特征向量的动态选择性集成方法[J];模式识别与人工智能;2014年05期

10 白剑;杜杏虎;张国顺;刘媛;;并行谱聚类算法[J];网络安全技术与应用;2013年11期

相关硕士学位论文 前1条

1 孟松杰;基于QAR的数据仓库的建设及在故障分析中的应用[D];中国民航大学;2011年



本文编号:1653985

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1653985.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户8f0cd***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com