PageRank算法的改进及在生物网络数据上的应用

发布时间：2018-08-16 11:17

【摘要】：作为世界上最成功的网页排名算法，PageRank算法于1998年由Google的两位创始人Larry Page和Sergey Brin开发，最初被用于Google的搜索引擎。该算法将用户的浏览行为描述为一个Markov随机冲浪模型，为了满足网络的强连通特性，算法中加入阻尼系数d，d的取值范围为[0,1]，因此可将其视为如何选取下一个节点的概率。从概率的角度，PageRank算法可理解为：以网络中的某一节点为初始点，用户以d的概率沿着该节点某一条链出链接继续浏览，以1-d的概率随机浏览整个网络中的某一点。PageRank算法的运算过程须经多次迭代，由于算法的转移矩阵为一次马尔科夫链，根据马氏链的性质，算法最终能够平稳分布，而稳定状态下各网页被访问的概率分布即为各节点的PageRank值，表达了各点在网络中的重要程度。 PageRank算法单纯基于网络的拓扑结构来判定节点的重要程度，传统PageRank算法将一个节点的所有链出节点视为同等重要，因此在计算转移矩阵时，如果节点i有n个链出节点，那么i的每个链出节点平均分得i的PageRank值的1/n。这显然是不合乎事实的，以Internet为例，假设有网页i指向n个子网页，尽管这n个网页同为i的后继网页，但这其中有些网页相对重要，客户访问它的几率就大，因此i的n个链出节点中有些权值要大于1/n，有些则很小。本文基于这一思想，对传统的PageRank算法进行了改进，理论上改进算法更符合客观事实。最后利用JAVA和MATLAB等技术将算法编程实现。 PageRank算法的易于理解性使得该算法广为流传，学者们已不满足只在搜索引擎中使用该算法，，凡是具有大量数据的网络环境都可以利用PageRank算法加以分析。生物信息学中的蛋白质-蛋白质交互网络（PPI）中包含了大量的待挖掘的生物信息，与万维网（www）一样，两者都为Scale-Free网络，这为PPI网络与PageRank算法的结合提供了前提。很多学者已经在这一领域做过尝试。本文在与前人数据相同的PPI网络数据上运行改进的PageRank算法程序，利用已知的几种在黑素瘤病人血液中测出的差异表达的蛋白质，找到了与这些蛋白相关的一组蛋白质，之后通过对比分析新旧两组结果，验证了改进算法的结果更符合生物意义。
[Abstract]:The PageRank algorithm, the world's most successful page ranking algorithm, was developed in 1998 by Larry Page and Sergey Brin, the founders of Google, and was originally used as a search engine for Google. In this algorithm, the browsing behavior of users is described as a Markov random surfing model. In order to satisfy the strong connectivity of the network, the value range of adding damping coefficient dbd in the algorithm is [0], so it can be regarded as the probability of how to select the next node. From the point of view of probability, the PageRank algorithm can be understood as: taking a node in the network as the initial point, the user continues to browse along a link out of the chain of the node with the probability of d. The operation process of the PageRank algorithm has to be iterated through several iterations with the probability of 1-d. Because the transfer matrix of the algorithm is a Markov chain, according to the properties of Markov chain, the algorithm can be distributed stably. In the stable state, the probability distribution of each web page being visited is the PageRank value of each node, which expresses the importance of each point in the network. The PageRank algorithm determines the importance of the node simply based on the topology of the network. In the traditional PageRank algorithm, all the nodes out of the chain of a node are considered as equally important, so when calculating the transfer matrix, if node I has n nodes out of the chain, then the average value of the PageRank value of I is 1 / n of the PageRank value of I for each out node of I. This is obviously not true. Let's say Internet, for example, has a web page I pointing to n subpages, and even though these n pages are successors to I, some of these pages are relatively important, and the chances of clients visiting them are high. As a result, some of the n chain out nodes of I have weights greater than 1 / n, while others are very small. Based on this idea, this paper improves the traditional PageRank algorithm, which is more in line with the objective facts in theory. Finally, using JAVA and MATLAB to program the algorithm. The easy-to-understand PageRank algorithm makes the algorithm popular, scholars are not satisfied to only use the algorithm in search engines. Any network environment with a large amount of data can be analyzed by using PageRank algorithm. The protein-protein interaction network (PPI) in bioinformatics contains a lot of biological information to be mined. Like the World wide Web (www), both of them are Scale-Free networks, which provide a prerequisite for the combination of PPI networks and PageRank algorithms. Many scholars have tried in this field. In this paper, we run the improved PageRank algorithm program on the PPI network data which is the same as the previous data, and find a set of proteins associated with these proteins using several known differentially expressed proteins in the blood of melanoma patients. By comparing the results of the two groups, it is proved that the improved algorithm is more suitable for biological significance.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP301.6;TP393.0

【参考文献】