基于复杂网络的网络大数据聚类研究

发布时间：2018-03-15 11:31

本文选题：大数据　切入点：复杂网络　出处：《兰州交通大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着通讯科技和IT技术的飞速发展,网络规模不断地扩大及结构逐渐的复杂,使得网络产生海量信息数据,即大数据(Big Data)。大数据的出现使得人类社会从信息时代过渡到大数据时代。在大数据时代,网络数据表现出复杂性、多样性以及异质性等特征。在真实网络中,社区结构(又称聚类特性)是复杂网络大数据的重要特征,即社区内部连接比较紧密,社区之间连接比较稀疏。社区结构是分析网络大数据的关键与基础,具有重要的研究价值和科学意义。目前社区发现已经成为数据挖掘等众多领域最具挑战性的研究课题之一。本文主要围绕同质网络和异质网络社区发现算法进行研究,主要包括以下几个方面的内容:(1)为了能够有效地挖掘复杂网络中的重叠社区结构,本文提出一种基于极大团连接相似性的重叠社区发现算法。该算法引入极大团思想来初始化网络的社区结构,并根据团间的共享邻居节点和团间桥接边对社区间的连接性进行量化处理,以此为依据合并网络中的社区,得到较为合理的重叠社区结构。将该算法与经典的CPM算法在四个真实网络上进行对比实验,实验结果表明,通过本文算法得到的网络社区结构在精确率、覆盖率和模块度等方面有所提高,证明该算法发现的重叠社区结构较为合理。(2)针对传统的同质网络社区发现算法无法充分利用异质信息的问题,本文提出一种基于语义路径的异质网络社区发现算法,充分考虑网络中异质节点和边所包含的信息。该算法首先通过FindPath方法选取语义路径;然后提取出不同语义路径下对象的相似性矩阵;最后提取不同语义路径下的对象特征并进行融合,采用K-Means算法得到最终的社区划分结果。并在真实数据集上进行实验,实验结果表明该算法的有效性。(3)针对异质网络的社区发现算法中无法充分保留异质网络原始结构及其信息,而且较少考虑异质节点同属一个社区的情况,本文提出一种基于二部极大团的异质网络社区发现算法。该算法引入二部极大团理论:首先,以关键节点所属规模最大的二部极大团作为初始社区;然后,以量化的社区的邻居节点与社区的相似性为依据对社区进行扩充;最后,划分出合理的社区结构。通过在人工异质网络和真实异质网络上进行对比实验。实验结果表明:该算法所划分的社区准确率和模块度都相对较高,证明了该算法能够有效的发现异质网络社区结构。
[Abstract]:With the rapid development of communication technology and IT technology, the network scale is expanding and the structure is gradually complex, which makes the network produce massive information data. That is, big data and Big data. The emergence of big data makes the human society transition from the information age to the big data era. In the age of big data, the data on the web show the characteristics of complexity, diversity and heterogeneity. In the real network, the Internet is characterized by its complexity, diversity and heterogeneity. Community structure (also called clustering characteristic) is an important feature of big data in complex network, that is, the connection within the community is relatively close, and the connection between the communities is relatively sparse, the community structure is the key and foundation of analyzing the network big data. Community discovery has become one of the most challenging research topics in many fields, such as data mining. Mainly including the following aspects: 1) in order to be able to effectively mine overlapping community structures in complex networks, In this paper, an overlapping community discovery algorithm based on the similarity of maximal cluster connection is proposed, which introduces the idea of maximal cluster to initialize the community structure of the network. According to the shared neighbor nodes and bridging edges between groups, the connectivity of the communities is quantified, which is based on the merging of communities in the network. A reasonable overlapping community structure is obtained. The comparison between this algorithm and the classical CPM algorithm is carried out on four real networks. The experimental results show that the network community structure obtained by this algorithm is accurate. The coverage and module degree have been improved, which proves that the overlapping community structure found by this algorithm is reasonable. (2) aiming at the problem that the traditional community discovery algorithm of homogeneous network can not make full use of heterogeneous information, In this paper, a semantic path-based heterogeneous network community discovery algorithm is proposed, which takes full account of the information contained in heterogeneous nodes and edges in the network. Firstly, the semantic path is selected by the FindPath method. Then, the similarity matrix of objects under different semantic paths is extracted. Finally, the features of objects in different semantic paths are extracted and fused. Finally, the final community partition results are obtained by using K-Means algorithm, and the experiments are carried out on real data sets. The experimental results show that the algorithm is effective. (3) the original structure and information of heterogeneous network can not be fully preserved in the community discovery algorithm of heterogeneous network, and less consideration is given to the case that heterogeneous nodes belong to the same community. In this paper, a community discovery algorithm for heterogeneous networks based on bipartite maximal clusters is proposed. The bipartite maximal cluster theory is introduced in this algorithm: firstly, the bipartite maximal cluster with the largest size belongs to the key nodes as the initial community; then, The community is expanded on the basis of the similarity between the neighborhood nodes and the community. Finally, The reasonable community structure is divided. The results of the experiments on artificial heterogeneous network and real heterogeneous network show that the accuracy and modularity of the proposed algorithm are relatively high. It is proved that the algorithm can effectively discover the community structure of heterogeneous networks.
【学位授予单位】：兰州交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;O157.5

【参考文献】