基于节点相似度和链接次数组合时序的链接预测算法

发布时间：2018-04-23 03:37

本文选题：链接预测 + 节点相似度　；参考：《吉林大学》2017年硕士论文

【摘要】：在网络中,节点表示实体,链接表示它们之间的关系。随着越来越多真实网络数据的获得,通过对网络的分析来挖掘一些有价值的规律成为研究热点。作为链接挖掘最重要的问题之一,链接预测,即根据观察到的节点和链接信息,来估计两个节点之间存在链接的可能性。在众多应用的推动下,链接预测的研究取得了丰硕的成果。目前采用较为广泛的是基于节点相似度的方法,通过相似性分数的大小,预测产生链接的可能性。相似度的计算主要包括基于网络拓扑的方法与基于节点属性的方法,此外社区信息也被证明有助于链接预测。上述静态链接预测方法曾在某些领域取得了不错的效果,但是在现实世界中,网络往往是动态变化的。静态方法里,网络随时间的变化被忽视,如果只采用最近一个时间快照下的网络图,网络变化比较频繁时,预测效果就会急剧下降;如果把历史上各时间快照下的网络图叠加,则不能用于链接重复发生的情况,如电话、邮件等通信链接。随着互联网的发展,链接重复发生的场景越来越广泛,网络的演化越来越普遍,静态链接预测方法已远远不能适应新形势下的需求,因此,近些年时间信息逐渐得到重视。目前,针对链接预测问题的研究,主要有两个方向,一个方向是继续完善静态方法,充分提取当前观察到的有用网络信息,包括拓扑信息、属性信息、社区信息等;另一个方向是给空间结构加上时间维度,考虑如何利用网络随时间的变化,更好地完成预测。时间序列在描述时间信息上取得了较好的效果,将历史上各时间段的网络信息表示为离散的时间序列图,并进行链接预测,主要有两种方式。一种是节点间链接次数的时间序列,仅仅根据节点间过去的链接次数预测未来的链接情况,取得了与静态方法类似的结果,将其与静态方法相结合,能进一步提高预测效果。这种方法的优势在于,考虑链接历史上出现的次数而不是是否出现,时间序列模型较好地利用了链接的变化情况及最近时间的链接信息。同时,混合模型将静态方法预测新链接的能力与时间序列预测重复链接的能力结合起来,是一种较为全面的方法。这种方法存在的问题是,对于新链接,由于失去了链接次数时间序列,混合模型就降级成了静态相似度方法;此外,混合模型将最终的静态方法预测值与时间序列预测值相乘,难以描述每个时间段的网络信息。另一种时间序列方法做出了改进,采用节点相似性分数的时间序列,根据节点间历史上各时间段的相似性分数,预测未来的相似性,从而预测链接情况。这种方法也尝试将节点间通过整个网络计算的相似性分数与节点间真正发生的链接次数结合,混合模型将每个时间段的相似性分数与链接次数归一化后相加,以此作为时间序列的输入,然后以一元时间序列预测值作为未来链接发生的分数。但是,由于模型过于简单未能描述相似性分数与链接次数的关系,两者的变化规律不同,混合模型得到的结果反而不如仅仅采用相似性分数时间序列。针对以上不足,本文提出了一种新的基于节点相似度和链接次数组合时序的链接预测方法SOTS(Similarities and Occurrences Time Series)。首先通过有趋向的随机游走,计算历史各时间段节点间的相似性分数,然后采用时间序列模型将其与各时间段节点间的实际链接次数组合起来,预测下个时间段各节点对发生链接的可能性。通过两种组合时间序列模型,本文研究了节点间相似性分数与实际链接次数的关系。该方法能够用于演化网络中未来新的链接以及重复出现链接的预测。本文贡献如下:(1)采用一种新的方法将属性社区与网络拓扑组合起来,计算相似性分数。(2)研究了链接的形成与相似性分数的关系。(3)将相似性分数与链接次数有机结合,充分提取每个时间段的信息。尤其是二元时间序列模型的结合方式,有效描述了二者随时间的协同演化。通过对时间序列与静态信息的分析,我们将网络结构的时间、空间两个维度结合起来。该方法比传统方法利用的网络信息更加全面,模型更加有效,经过详实的实验,本文在中文DBLP数据集上评价了该方法与前人方法的预测效果,实验证明,该方法提高了约15%的预测准确度,达到了本文工作的预期目标。
[Abstract]:In the network, nodes represent entities and links represent their relationship. With the acquisition of more and more real network data, mining some valuable rules through the analysis of the network becomes a hot topic. As one of the most important issues of link mining, link prediction, which is estimated by the observed nodes and link information, is estimated to be two The possibility of links between nodes. With the promotion of many applications, the research of link prediction has achieved fruitful results. At present, the widely used method based on node similarity is to predict the possibility of linking by the size of similarity scores. The calculation of similarity degree mainly includes the method based on network topology and the method of similarity calculation. In addition, community information is also proved to be helpful to link prediction. The above static link prediction method has achieved good results in some fields, but in the real world, the network is often dynamic. In static methods, the network is neglected with time, if only the last time snapshot is used. If the network changes are more frequent and the network changes are more frequent, the prediction effect will fall sharply. If the network graph under the fast photos of all the times in history is superimposed, it can not be used for the repetition of links, such as telephone, mail and other communication links. With the development of the Internet, the scenes of heavy links are becoming more and more extensive and the evolution of the network becomes more and more more and more. Generally, the static link prediction method has been far from adapting to the needs of the new situation. Therefore, the time information has been paid more attention in recent years. At present, there are two main directions for the study of link prediction. One direction is to continue to improve the static method, to fully extract useful network information, including topological information, and attributes. Information, community information and so on; the other is to add time dimension to space structure, consider how to make use of the time changes of the network to better complete the prediction. The time series has achieved good results in describing time information, and the network information table of each time in history is shown as a discrete time series graph, and the link is predicted. There are two ways. One is the time sequence of the number of links between nodes. It is only based on the number of links between the nodes to predict the future link. The result is similar to the static method. The combination of the static method and the static method can further improve the prediction effect. The advantage of this method is to consider the times in the link history. It is a more comprehensive method to combine the ability of the static method to predict the ability of the new link and the ability of the time series to predict repeated links. The problem of this method is that it is new to the new model. Because of the loss of the time series of link times, the hybrid model is degraded into a static similarity method. In addition, the hybrid model multiplies the predicted value of the final static method with the time series prediction value, and it is difficult to describe the network information in each time period. The other time series method has made an improvement and uses the node similarity score. The sequence, according to the similarity score of each time interval between the nodes, predicts the future similarity and predicts the link situation. This method also tries to combine the similarity scores of the whole network with the number of real links between nodes. After normalization, it is used as the input of the time series, and then the prediction value of a single time series is used as a fraction of the future link. However, because the model is too simple to describe the relationship between the similarity scores and the number of links, the change rules of the two are different, and the results obtained by the mixed model are not as good as using the similarity only. In this paper, a new link prediction method SOTS (Similarities and Occurrences Time Series) based on the node similarity and the number of link times is proposed in this paper. First, the similarity scores between the nodes in each period of history are calculated by the random walk with a tendency, and then the time series model is used. By combining the actual number of links between nodes in each time period, the possibility of linking to each node in the next time period is predicted. Through two combination time series model, the relationship between the similarity scores and the number of links between nodes is studied. This method can be used for the future new links and duplication in the evolution network. The contributions of the present link are as follows: (1) a new method is used to combine attribute communities and network topology to calculate similarity scores. (2) the relationship between the formation of links and the similarity scores is studied. (3) the information of each time period is extracted by combining the similarity scores with the number of links, especially the two yuan time series. The combination of the model, effectively describes the co evolution of the two with time. Through the analysis of time series and static information, we combine the time and space two dimensions of the network structure. This method is more comprehensive than the traditional method of network information, and the model is more effective. After a detailed experiment, this paper is in the Chinese DBLP number. According to the set, the prediction results of the method and the predecessor method are evaluated. The experiment proves that the method improves the prediction accuracy of about 15%, and achieves the expected goal of this work.

【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：O211.61

【相似文献】