大型在线社交网络(OSN)用户采样、测量、评价关键问题研究
发布时间:2018-05-20 23:22
本文选题:社交网络 + 核心网络 ; 参考:《北京邮电大学》2014年硕士论文
【摘要】:社交网络数据有两大特点:一是数据量巨大,国内外流行的社交平台中的用户数量都在一亿以上,这些用户之间的边就更多了,对整个网络的总体进行分析是不现实的;二是网络结构复杂,整个网络的关系都是用户自行组织起来的,其内部蕴含了多层次的实体关系,目前基于采样的研究方法很难还原和处理如此复杂的内相干性。可以说,社交网络内部多层次的实体关系是影响社交网络用户采样、测量和评价的关键问题。本文希望对社交网络内部多层次的实体关系进行探索性研究来更好的进行社交网络的用户采样、测量和评价。 本文主要聚焦在大型在线社交网络中不对称关系(多层次的实体关系)的研究上,具体方法为通过采样的方法获取社交网络中的层次化结构,然后测量层次化网络中的属性特征,最后针对测量结果给出评价。社交网络的不对称性主要体现在节点的不对称性上即节点的多层次性,分为用户影响力的不平衡性和边的不对称性。本文将在社交网络中占据优势的节点称为核心节点,即上面所说的“明星用户”,处于劣势的节点称为外围节点。本文从节点的层次性角度出发,将社交网络分为三部分:核心网络、外围网络和核心-外围结构,其中核心网络是本文研究的重点。 本文在第三章中讨论了社交网络的节点层次性之后选取目前国内规模最大,影响力最广的新浪微博作为研究对象,对爬取的数据进行清理之后,构建了一个拥有3500万新浪微博用户的网络。首先,经过统计分析,本文给出了这个网络的度分布特征和入度出度比特征,结果发现新浪微博的度分布符合典型的幂率分布;紧接着本文从3500万用户的网络中找出了核心用户(定义粉丝数大于5000的用户属于核心用户)组成的核心网络,从度分布、入度出度比、聚类系数、网络密度、边对称性这几个属性的角度来分析新浪微博中核心网络的性质;然后为了验证不同采样方法检测核心网络的有效性,本文重点对比分析了滚雪球采样和随机游走,发现两种方法获取的核心网络的网络密度相差不大且都要比真实网络更加稀疏,但是当采样比较低时,滚雪球采样方法获取的核心网络在网络密度、边对称性、聚类系数三种拓扑属性评价上都要比随机游走更好,更加接近真实网络,因此本文认为滚雪球采样在核心网络的研究中更具有优势;最后在前面工作的基础上,设计三个实验深入分析滚雪球采样随着采样种子数量,采样深度,采样比的变化在检测核心网络方面的有效性,发现种子的随机性影响收敛的速度和采样的偏差,采样比控制着扩展速度,采样深度实际上是由前两个因素决定的,但是在可控制的情况下,采样深度可以根据要爬取的网络进行调整。据作者所知,本文最先聚焦在检测社交网络的核心并将社交网络的核心作为网络的特征,同时分析了核心节点的覆盖度以及度分布,核心用户粉丝网络的密度,它们体现了大量的核心网络特征
[Abstract]:There are two characteristics of social network data: one is the huge amount of data, the number of users in the popular social platform at home and abroad is more than one hundred million, the edges of these users are more, the overall analysis of the whole network is unrealistic; the two is that the network structure is complex and the whole network relationship is organized by the user itself. It is difficult to restore and deal with such complex internal coherence. It can be said that the multi-level entity relationship within the social network is the key problem that affects the sampling, measurement and evaluation of social network users. This article hopes to make a multilevel entity relationship within the social network. Conduct exploratory research to better sample, measure and evaluate social network users.
This paper focuses on the research of asymmetric relationships (multi-level entity relationships) in large online social networks. The specific method is to obtain hierarchical structure in social networks by sampling methods, and then measure attribute characteristics in hierarchical networks. Finally, the evaluation of measurement results is given. The main body of social network asymmetry is the main body. At present, the asymmetry of nodes is the multilevel of nodes, which are divided into the imbalance of the user influence and the asymmetry of the edges. In this paper, the nodes which occupy the advantage in the social network are called the core nodes, that is, the "star users" mentioned above, and the disadvantaged nodes are called the outer circumference nodes. This paper starts with the hierarchical point of view of the nodes, The social network is divided into three parts: the core network, the peripheral network and the core periphery structure. The core network is the focus of this paper.
In the third chapter, in the third chapter, the nodes of the social network are discussed, and the largest and most influential micro-blog in China is selected as the research object. After cleaning up the crawling data, a network of 35 million Sina micro-blog users is constructed. First, the degree of the network is given by statistical analysis. It is found that the degree distribution of sina micro-blog conforms to the typical power distribution, and the core network of the core users (which defines the number of fans more than 5000 is the core user) is found from the network of 35 million users, from the degree distribution, the ratio of admission, the clustering coefficient and the density of the network. The nature of the core network in Sina micro-blog is analyzed by the angle of edge symmetry. Then, in order to verify the validity of the core network in different sampling methods, this paper focuses on the comparison and analysis of snowball sampling and random walk. It is found that the network density of the core networks obtained by the two methods is not very different and is more than the real network. It is more sparse, but when the sampling is low, the core network obtained by the snowball sampling method is better than random walk in the three topological properties evaluation of network density, edge symmetry and clustering coefficient, and it is closer to the real network. Therefore, this paper thinks that snowball sampling is more advantageous in the Research of nuclear core network; finally, it is in the front. On the basis of the work, three experiments are designed to analyze the effectiveness of the snowball sampling, with the number of samples, the depth of sampling and the change of sampling ratio in the core network. It is found that the randomness of the seeds affects the speed of convergence and the deviation of the sampling. The sampling ratio controls the expansion speed, and the sampling depth is actually the first two factors. It is determined, but under control, the sampling depth can be adjusted according to the network to be crawled. According to the author's knowledge, this article is the first to focus on the core of the social network and the core of the social network as a feature of the network, at the same time analyses the coverage and degree distribution of the core nodes, the density of the core user's fan network. They embody a large number of core network features
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.0
【参考文献】
相关期刊论文 前1条
1 熊文海;赵继军;S.Boccaletti;V.Latora;Y.Moreno;M.Chavezf;D.-U.Hwang;;复杂网络:结构与动力学(英文)[J];复杂系统与复杂性科学;2006年04期
,本文编号:1916697
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1916697.html