基于标签传播算法的社区发现新算法

发布时间：2018-08-12 19:39

【摘要】：互联网近年来的飞速发展,造就了一批以社交为主的网站,其中国外的Facebook,Twitter,Google+,国内的QQ空间,豆瓣,人人等最为流行。这些社交网站每天都会有大量的用户使用,并且产生大量的分享数据,建立新的朋友关系等。对于这些数据来说,具有很高的利用价值,比如网络营销,舆情分析等。因此对这些数据的处理方式尤为重要,其中社区发现是研究热点。本文主要是针对目前常用的社区发现算法进行改进。最常用的标签传播算法LPA的思想是在初始情况下,网络中的每个节点都被初始化为唯一的标签,在迭代更新每个节点的标签时,节点是根据其邻居节点中标签个数最多的作为更新标签,如果标签个数最多的标签并不是唯一的,那么随机从中选择一个标签来更新当前节点,最终达到收敛或者震荡,该算法停止。由于该算法的思想和实现过程导致该算法有一些缺点,例如不稳定性,发现的社区要么是巨型社区,要么就是无意义的小型社区,分布极其不均匀并且对网络的结构比较敏感,二分网络情况下会发生循环震荡。针对上述LPA算法的一系列缺点,本文提出了LAAPA(label-attributeattenuation progagation algorithm),基于标签属性和衰减因素的社区发现算法。该算法引人了传播衰减因子和节点属性,其中传播衰减因子顾名思义就是节点标签的传播距离是有限的,并且随着距离的增加节点标签的影响力逐渐降低,更新标签的权重也随之减少;节点属性是指在社交网络中节点之间的关系并不仅仅是关注与被关注这种简单的传统上的“边”,为了与实际情况更加吻合本文提出了节点属性,具体是指将节点的其他属性比如豆瓣网中用户加入的“小组”作为节点的属性,节点之间的相同属性将会反映到节点之间的边上,使用权值来表示。在算法迭代中,节点更新标签时,将考虑邻居节点的标签传播距离,节点之间边的权值,节点的度等因素。本文通过两组标准数据对提出的LAAPA算法和LPA算法进行比较,在社区大小,模块度等方面LAAPA算法发现的社区比LPA算法效果好。在使用Scrapy抓取的豆瓣网数据中,经过清洗格式化后验证了该数据符合社交网络的特性,“小世界”,节点中心性等。利用两种社区发现算法进行社区发现,进行对比,结果也是LAAPA算法发现的社区质量比LPA算法质量高,并且稳定,社区分布均匀,没有巨型社区的出现。
[Abstract]:The rapid development of the Internet in recent years has created a number of social-oriented websites, including Facebook Twitter Google, domestic QZone, Douban, everyone and so on the most popular. These social networking sites are used by a large number of users every day and generate a lot of data sharing, new friendships and so on. For these data, it has a high value, such as network marketing, public opinion analysis and so on. Therefore, the processing of these data is particularly important, among which community discovery is a hot topic. This paper is mainly aimed at the improvement of community discovery algorithm commonly used at present. The idea of LPA, the most commonly used tag propagation algorithm, is that, initially, each node in the network is initialized as a unique tag, and each node's label is updated iteratively. The node updates the label according to the number of tags in the neighbor node. If the label with the largest number of tags is not unique, then randomly select a label from which to update the current node, and finally achieve convergence or oscillation. The algorithm stops. As a result of the thought and implementation process of the algorithm, the algorithm has some disadvantages, such as instability, the communities found are either giant communities or meaningless small communities, which are extremely unevenly distributed and sensitive to the structure of the network. In the case of binary networks, cyclic oscillations occur. In view of a series of shortcomings of the above LPA algorithm, this paper proposes a community discovery algorithm based on label attributes and attenuation factors for label-attributeattenuation progagation algorithm), (label-attributeattenuation progagation algorithm),). The algorithm induces propagation attenuation factor and node attribute, in which the propagation attenuation factor is, as the name implies, that the propagation distance of the node label is limited, and the influence of the node label decreases gradually with the increase of the distance. The weight of updating tags is also reduced; the node attribute refers to the relationship between nodes in the social network is not only concerned with and be concerned about this simple traditional "edge", in order to more consistent with the actual situation, this paper proposed node attributes. It refers to the other attributes of nodes, such as the "group" added by users in the Douban network, and the same attributes between nodes will be reflected to the edge of the nodes, and the value of the right to use will be used to express them. In the iteration of the algorithm, the label propagation distance of the neighbor node, the weight of the edge between nodes, the degree of the node and so on will be taken into account when the node updates the label. This paper compares the proposed LAAPA algorithm with LPA algorithm through two sets of standard data. The community found by LAAPA algorithm is better than that of LPA algorithm in community size, module degree and so on. In the data collected by Scrapy, it is verified that the data conforms to the characteristics of social network, "small world", node centrality and so on. Compared with the two community discovery algorithms, the results show that the community quality of LAAPA algorithm is higher than that of LPA algorithm, and the quality of community is stable, the community distribution is even, and there is no giant community.
【学位授予单位】：西安电子科技大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP301.6

【参考文献】