基于Hadoop的社区发现算法并行化研究

发布时间：2018-04-22 08:04

本文选题：社区发现 + Hadoop　；参考：《江西理工大学》2017年硕士论文

【摘要】：社区发现算法是研究复杂网络中社区结构的主要方法。随着网络规模爆发式的增长,传统单机、串行的社区发现算法已不适用于处理当前大规模的网络。Hadoop作为新兴的一种大数据处理技术,因其高扩展、高可靠、编程模型简单受到许多开发者的青睐。针对当前串行社区发现算法处理网络规模有限问题,本文结合Hadoop框架在大数据处理方面的优势提出两种串行社区发现算法并行化改造方案。针对Fast-Newman算法计算节点模块度复杂度比较高问题,本文提出基于Hadoop框架调度的Fast-Newman并行算法。并行化Fast-Newman算法存在的主要难点为在map函数中各节点无法获取其邻居节点的信息,故还需借助Web服务提供全局图信息。改进后的Fast-Newman算法将在map函数中并行的计算每个节点与其邻居节点合并后的模块度增量,在reduce函数中汇总找出模块度增量最大的两个节点并将该两节点合并,此为一次合并过程。合并后的结果重新作为map函数的输入,迭代执行map、reduce过程直到所有节点合并成一个社区。采用Ego-Facebook作为数据集,在仿真环境下实验结果表明并行化后的Fast-Newman算法具有较高的加速比。针对处理大规模网络难问题,本文提出基于Hadoop的Fast-Unfolding并行化算法PFU(Parallel-Fast-Unfolding)。该算法主要采用“分而治之”的思想,首先将大规模网络分区并各自合并,然后根据各分区合并结果重构网络,最后迭代合并重构网络直到社区结构不再发生变化。该并行化方案存在两个难点:一是如何保证分区后边连接信息不会丢失;二是分区完成后如何重构网络。针对上述两个难点,本文通过改进数据存储方式以及设计重构方案有效地解决了该问题。在真实网络和生成网络两种数据集上实验结果表明,PFU算法在保证准确率的基础上明显的提高了算法运行的效率,具有较好的扩展性。最后,根据map、reduce阶段输出的中间文件,用gephi软件对结果进行可视化提高了PFU算法的应用价值。
[Abstract]:Community discovery algorithm is the main method to study the community structure in complex networks. With the explosive growth of network scale, the traditional single-machine, serial community discovery algorithm is no longer suitable to deal with the current large-scale network. Hadoop as a new big data processing technology, because of its high expansion, high reliability, Simple programming models are popular with many developers. In view of the limited network size of the current serial community discovery algorithm, this paper proposes two parallel transformation schemes of serial community discovery algorithm based on the advantages of Hadoop framework in big data processing. Aiming at the high complexity of computing node modules in Fast-Newman algorithm, a Fast-Newman parallel algorithm based on Hadoop framework scheduling is proposed in this paper. The main difficulty of parallelization Fast-Newman algorithm is that each node can not get the information of its neighbor nodes in the map function, so it is necessary to provide global graph information with the help of Web service. The improved Fast-Newman algorithm will calculate the modular degree increment of each node and its neighbor nodes in parallel in map function, and find out the two nodes with the largest module degree increment in the reduce function and merge the two nodes together, which is a merging process. The merged results are reused as input to the map function, iterating through the map-reduce process until all nodes are merged into one community. Using Ego-Facebook as data set, the simulation results show that the parallel Fast-Newman algorithm has a high speedup. In order to deal with the problem of large scale network, this paper proposes a parallel Fast-Unfolding algorithm based on Hadoop, PFUU Parallel-Fast-Unfolding algorithm (PFU / Parallel-Fast-Unfolding). The algorithm mainly adopts the idea of "divide and conquer". Firstly, the large-scale network is partitioned and merged separately, then the network is reconstructed according to the results of each partition merging. Finally, the network is merged and reconstructed iteratively until the community structure is no longer changed. There are two difficulties in the parallelization scheme: one is how to ensure that the connection information behind the partition will not be lost; the other is how to reconstruct the network after the partition is completed. In view of the above two difficulties, this paper solves the problem effectively by improving the data storage method and designing the refactoring scheme. Experimental results on two data sets of real network and generated network show that the PFU algorithm can obviously improve the efficiency and scalability of the algorithm on the basis of ensuring the accuracy. Finally, according to the intermediate file output from map reduce phase, the application value of PFU algorithm is improved by using gephi software to visualize the result.
【学位授予单位】：江西理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;O157.5

【参考文献】