一种聚类算法的并行化改进及其在微博用户聚类中的应用

发布时间：2018-04-12 19:40

本文选题：聚类算法 + 并行化　；参考：《上海交通大学》2014年硕士论文

【摘要】：聚类分析时数据挖掘中的重要技术。K均值算法是聚类分析中应用最广泛的算法之一，被广泛应用于计算机视觉、文本挖掘、客户分析等各个领域。K均值算法具有简单高效的优点，同时也存在着对初始聚类中心敏感、聚类个数K需要人工给出等问题。凝聚模糊K均值算法是一种K均值算法的改进算法，该算法不易受初始点影响并且可以通过一种凝聚的方式自动对聚类个数进行搜索。但是凝聚模糊K均值算法也有迭代次数过多的缺陷。该文首先针对凝聚模糊K均值算法的缺陷提出了一种改进的凝聚模糊K均值算法。改进算法使用一种初始中心选择方法替代凝聚模糊K均值算法采用的随机初始值选择方法，减少了所需的迭代次数。同时改进算法应用基于MapReduce框架的分布式实现增加了算法处理大数据的能力，并在Hadoop及Mahout环境下进行了实现。之后对微博用户聚类分析中的方法和问题进行了研究，引入了基于维基百科的微博文本主题分析方法提取用户特征。最后应用改进算法对微博用户进行聚类并对聚类结果进行分析。实验结果表明，，改进算法可以减少运行过程所需地迭代次数并且在集群上具有很好地伸缩性能。对微博用户聚类的结果进行分析表明，该算法可以获得适合的用户聚类结果。
[Abstract]:The clustering analysis of data mining technology in the important.K means algorithm is one of the most widely used algorithm in clustering analysis, is widely used in computer vision, text mining, customer analysis and other fields of.K means algorithm has the advantages of simple and efficient, there are also sensitive to the initial clustering center cluster number K manual is given other issues. Agglomerative fuzzy K means algorithm is an improved K algorithm for k-means algorithm, this algorithm is not easily affected by initial points and can be a way to automatically gather cluster number search. But the defect of condensed fuzzy K mean algorithm also has an excessive number of iterations.
This paper firstly condensed defects of fuzzy K means algorithm proposed an improved agglomerative fuzzy K means algorithm. The improved algorithm uses an initial center selection method instead of the random initial condensation of fuzzy K means algorithm uses value selection method to reduce the number of iterations required. Improved algorithm implementation of distributed MapReduce framework has increased the ability to handle large data based on the same algorithm, and implemented in Hadoop and Mahout environment. The method and problem analysis of micro-blog users clustering is studied, the introduction of micro blog Wikipedia this topic analysis method based on feature extraction of user. Finally, the improved algorithm is applied to clustering and clustering results of micro-blog users were analyzed. The experimental results show that the improved algorithm can reduce the number of iterations required for operation and has good scalability in cluster The results of the clustering of micro-blog users show that the algorithm can obtain the appropriate user clustering results.

【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP311.13

【参考文献】