大规模社会网络中影响最大化问题高效处理技术研究

发布时间：2019-06-22 16:25

【摘要】：近年来，随着互联网和Web2.0技术的飞速发展，社会网络作为沟通现实人类世界的桥梁，已经成为交互沟通、知识共享和信息传播的重要媒介和平台。其中，影响最大化问题旨在发现社会网络中最有影响力的节点集合，是社会网络分析领域的关键问题，在许多重要场景中有着广泛的应用，例如市场营销、广告发布、舆情预警、水质监测、疫情监控等，因此具有很高的研究价值和应用价值。许多影响最大化应用策略的制定和部署对于算法求解时间十分敏感，因此，高效的求解算法是当前学术界和工业界研究影响最大化算法的核心目标。已有的研究成果主要集中于一些贪心算法和启发式算法，存在求解速度慢、计算效率低的问题。另一方面，当今社会网络的数据规模海量、数据耦合度高、网络结构动态变化。当面对大规模社会网络时，已有算法暴露出许多难以克服的问题：第一，社会网络中节点影响值计算的可并行性问题。已有工作专注于降低算法的复杂度，没有充分利用已有的并行计算架构来加速问题求解。而实际社会网络中存在大量的节点影响值计算可由并行计算架构并发执行。因此，在挖掘算法的并行性方面，影响最大化算法的执行速度仍有很大的提升空间。第二，已有影响最大化算法未充分考虑社会网络节点分布特性。社会网络中节点的度分布服从典型的幂律分布。然而，现有贪心算法大多采用精确计算的方式来计算所有节点的影响值，导致大度数节点的计算复杂度十分高，成为算法执行的瓶颈。第三，社会网络拓扑结构的动态变化问题。现有影响最大化问题研究专注于静态网络；当网络动态变化时，大都需要针对全网进行重新计算节点的影响值，会造成大量冗余计算，导致性能无法满足大规模社会网络的需求。针对上述技术瓶颈，本文系统地研究了社会网络影响最大化问题的高效处理技术，从以下几个方面展开研究：针对现有方法并行度差、算法复杂度高，从而导致运行时间过长的问题，本文基于CPU+GPU的异构并行计算框架，设计和实现了一种具有高并行度的影响最大化算法BUTA，并针对GPU体系结构做了进一步算法优化。本文通过深入分析社会网络中节点之间的层次依赖关系，发现了节点影响值计算的可并行性。在此基础上，设计了一种自底向上的逐层扫描方法BUTA。BUTA算法一方面可以在保证算法精度的同时大幅度降低算法复杂度，另一方面BUTA充分利用了节点的层次分布，以高并行度计算节点的影响值。为了使BUTA算法更加适配CPU+GPU的异构并行计算框架，本文设计了三种优化方法：K层合并、数据重组和合并访存，分别用于降低运行时分支，减少访存次数和提高算法并行度。针对已有影响最大化算法未充分利用社会网络节点分布特性的问题，本文提出了一种基于蒙特卡洛理论的采样估计算法ESMCE，大幅度提升了计算效率。本文对社会网络中节点的分布特性进行了建模和挖掘；针对大度数节点计算时间长的问题，本文引入蒙特卡洛理论，设计了一种节点影响值估计方法ESMCE。在采样过程中，ESMCE算法设计了一种由幂律指数指导的采样节点个数计算方法。之后，根据估计误差同精度要求之间的差距，，本文提出了一种基于灰度预测模型的后续采样节点个数预测方法，以通过多次迭代采样来提高算法精度直至采样误差满足设定的精度要求。针对社会网络拓扑动态变化造成的已有算法计算效率低的问题，本文设计了一种增量式的影响最大化算法IncInf。本文深入分析了社会网络拓扑结构的演化特征，发现社会网络的拓扑变化满足优先连接原则，同时最有影响力节点的度数要明显大于普通节点。基于上述发现，本文设计了一种基于局部化理论的影响变化量高效计算方法。基于节点的影响变化量和原有网络对应的最有影响力节点信息，设计了一种剪枝策略，将候选节点范围有效缩小到影响值增长迅速、度排序靠前的节点集合，从而大幅度降低了动态社会网络影响最大化求解的复杂度，减少了程序运行时间。针对当前内容分发方法忽略了社会网络中的用户关联关系、地理位置等社会信息，从而导致用户访问延迟高的问题，本文设计了一种基于影响最大化的内容分发方法SCORE。同已有的内容分发方法不同，SCORE方法充分利用了社会网络中的用户信息，提出了一种基于影响最大化算法的缓存内容选择策略以快速准确地定位未来访问频率较高的关键内容。为了最小化访问延迟，SCORE方法通过挖掘用户之间的关联关系和地理位置信息，设计了一种基于K-MEANS聚类算法和加权球面平均计算方法的边缘服务器选择策略，从而将关键内容预先分发到离潜在访问用户最近的边缘服务器，以便于就近响应用户请求。实验结果表明，SCORE方法可以大幅度降低用户访问延迟，提升用户体验质量。综上所述，本文针对社会网络影响最大化问题的高效处理技术提出了有效的解决方案，并通过在真实数据集上进行实验验证了所提算法的有效性，对于推进社会网络影响最大化问题的研究和实用化具有一定的理论意义和应用价值。
[Abstract]:In recent years, with the rapid development of the Internet and the Web2.0 technology, the social network, as a bridge for communication with the real human world, has become an important medium and platform for interactive communication, knowledge sharing and information dissemination. Among them, the purpose of maximizing the problem is to find the most influential set of nodes in the social network, which is the key problem in the field of social network analysis, and has wide application in many important scenarios, such as marketing, advertisement release, public opinion warning, water quality monitoring, epidemic monitoring and the like. And therefore has high research value and application value. The development and deployment of many impact-maximizing application strategies are very sensitive to the algorithm's solution time, so the efficient solution algorithm is the core of the current research and influence maximizing algorithm in the academic and industry. The existing research results are mainly focused on some greedy algorithms and heuristic algorithms. On the other hand, the data of the present social network is massive, the data coupling degree is high, the network structure changes dynamically, In the face of the large-scale social network, the existing algorithms have exposed many problems which are difficult to overcome: the parallelizable question of the calculation of the influence value of the nodes in the first and the social networks The existing work is focused on reducing the complexity of the algorithm, and the existing parallel computing architecture is not fully utilized to speed up the problem. A large number of node-affected values in the actual social network can be executed by the parallel computing architecture. So, in the aspect of the parallelism of the mining algorithm, the execution speed of the influence maximizing algorithm is still greatly improved. the second, the existing effect maximization algorithm does not take full account of the social network node distribution The distribution of the nodes in the social network obeys the typical power law. However, most of the existing greedy algorithms are used to calculate the influence value of all the nodes in a precise way, which leads to the high computational complexity of the large-degree nodes and becomes a bottle to be executed by the algorithm. The Dynamic Change of the Structure of the Third and the Social Network Problem: The existing research on maximizing the problem is focused on the static network; when the network changes dynamically, it is necessary to re-compute the influence value of the node for the whole network, resulting in a large amount of redundant calculation, which can lead to the performance of the large-scale social network In view of the above-mentioned technical bottleneck, this paper systematically studies the high-efficiency processing technology of the problem of maximizing the influence of social network, from the following aspects: Open study: The algorithm complexity is high for the existing method, resulting in the running time In this paper, based on the heterogeneous parallel computing framework of CPU and GPU, this paper designs and implements a high-degree-of-parallelism impact-maximizing algorithm, BUTA, and advances in to the GPU architecture. In this paper, the hierarchical dependency of nodes in the social network is analyzed, and the calculation of the influence of the nodes is found. On the basis of this, a self-base layer-by-layer scanning method, BUTA. BUTA, is designed. On the one hand, the algorithm complexity can be greatly reduced while the accuracy of the algorithm is guaranteed. On the other hand, BUTA makes full use of the hierarchical distribution of the nodes and calculates the nodes with high degree of parallelism. In order to make the BUTA algorithm more suitable for the heterogeneous parallel computing framework of the CPU + GPU, the paper designs three optimization methods: K-layer combination, data recombination and combined access, which are used to reduce the running-time branch, reduce the number of visits and the increase. This paper presents a sampling estimation algorithm based on Monte-Carlo theory, which is based on the Monte-Carlo theory, which has not fully utilized the distribution characteristics of the social network nodes. The calculation efficiency is improved. The distribution characteristics of the nodes in the social network are modeled and mined, and the problem of the long-time calculation time of the large-degree nodes is solved. In this paper, the Monte-Carlo theory is introduced, and a kind of node influence value estimation is designed. Method ESMCE. In the process of sampling, the ESMCE algorithm designs a sampling section that is guided by the power law index. In this paper, based on the difference between the estimation error and the precision requirement, a method for predicting the number of subsequent sampling nodes based on the gray-scale prediction model is presented in this paper to improve the accuracy of the algorithm until the sampling error is satisfied by multi-iterative sampling. In order to solve the problem of the low efficiency of the existing algorithm caused by the dynamic change of the social network topology, this paper designs an incremental influence to the maximum. In this paper, the evolution of the social network topology is deeply analyzed, and the topology change of the social network is found to meet the principle of priority and the degree of most influential nodes It is obviously larger than the common node. Based on the above-mentioned findings, this paper designs a kind of effect based on the local theory the invention designs a pruning strategy based on the influence change amount of the node and the most influential node information corresponding to the original network, the pre-ordering node set is sorted, so that the complexity of the dynamic social network influence maximization solution is greatly reduced, The time of program execution is reduced. In view of the fact that the current content distribution method ignores the social information such as the user-related relation and the geographical position in the social network, the problem that the user's access delay is high is caused, and a kind of problem based on the influence is designed in this paper. The content distribution method SCORE is different from the existing content distribution method. The SCORE method makes full use of the user information in the social network, and proposes a cache content selection strategy based on the influence maximization algorithm to quickly and accurately locate the future. In order to minimize the access delay, the SCORE method designs a K-MEANS clustering algorithm and a weighted spherical average computing party based on the association relationship and the geographic location information between the users in order to minimize the access delay the edge server selection policy of the method thereby pre-distributing the key content to the edge server closest to the potential access user, so as to respond to the user's request nearby. The experiment results show that the SCORE method can greatly reduce the user access To sum up, this paper puts forward a valid solution for the efficient processing technology of the problem of maximizing the influence of social network, and it is made on the real data set In this paper, the effectiveness of the proposed algorithm is verified, and the research and practical tools for maximizing the influence of the social network
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP393.02;TP301.6

【参考文献】