异质网络上的自相似性连接算法研究与实现
发布时间:2018-05-30 22:19
本文选题:相似性连接 + 异质网络 ; 参考:《复旦大学》2013年硕士论文
【摘要】:相似性连接是很多研究问题的基础,不少实际问题也都可以归结为相似性连接。尽管近年来,越来越多的学者将注意力集中到网络数据和图数据上,如链接预测、社区检测和在线广告等,目前对于异质网络中的相似性连接却鲜有研究。特别地,已有工作都忽略了异质网络中节点和关系的类型以及由此带来的不同路径背后隐藏的不同语义信息,所以不能直接用于解决异质网络的相似性连接问题。 针对异质网络的异质性特点,本文提出了一套基于路径的异质网络自相似性连接解决方案。给定一条连接路径,该方案返回目标类型中前k个最相似的节点对象对。在这一框架下,我们分别研究了精确自相似性连接和近似自相似性连接两个具体的问题。在精确自相似性连接中,我们对空间数据库最近点对查找算法PSR进行改进,提出了一种基于R*树的算法ePS-join。虽然改进有效,但由于索引结构R*树本身的局限,该算法并不能很好地处理高维时的情况。于是,为了适应异质网络的另两个特点,即高维性和海量性,我们将精确问题转化成近似问题,以在准确率和效率之间做下折中。在近似自相似性连接中,我们引入LSH为数据集建立索引,提出了基于NBLSH的aPS-join,并利用相似性度量的性质设计了一种剪枝优化策略来进一步减小候选集的大小,改进后的算法称为基于BPLSH的aPS-join。 相比同质网络中的相似性连接算法LS-join,aPS-join能够结合异质网络中的语义信息。真实数据集上的实验表明,我们的算法不仅具有良好的效率,而且能保证很高的准确性。
[Abstract]:Similarity join is the foundation of many research problems, and many practical problems can be attributed to similarity join. Although in recent years, more and more scholars focus on network data and graph data, such as link prediction, community detection and online advertising, there is little research on similarity connection in heterogeneous networks. In particular, the existing work has neglected the types of nodes and relationships in heterogeneous networks and the different semantic information hidden behind different paths, so they can not be directly used to solve the problem of similarity connection in heterogeneous networks. According to the heterogeneity characteristics of heterogeneous networks, this paper proposes a path based solution for self-similarity connection of heterogeneous networks. Given a connection path, the scheme returns the first k most similar node object pairs in the target type. In this framework, we study the exact self-similarity connection and the approximate self-similarity connection respectively. In the accurate self-similarity connection, we improve the nearest point search algorithm (PSR) of spatial database, and propose an algorithm based on R* tree (ePS-jointing). Although the improvement is effective, due to the limitation of the index structure R* tree itself, the algorithm can not deal with the high dimensional time very well. Therefore, in order to adapt to the other two characteristics of heterogeneous networks, that is, high-dimensional and magnanimity, we transform the exact problem into an approximate one, so as to make a compromise between accuracy and efficiency. In approximate self-similarity join, we introduce LSH to index data sets, propose aPS-join based on NBLSH, and design a pruning optimization strategy to further reduce the size of candidate set by using the property of similarity measure. The improved algorithm is called aPS-join based on BPLSH. Compared with the similarity join algorithm in homogeneous networks, LS-joinaPS-join can combine semantic information in heterogeneous networks. Experiments on real data sets show that our algorithm not only has good efficiency, but also ensures high accuracy.
【学位授予单位】:复旦大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP311.13
【相似文献】
相关期刊论文 前10条
1 郝建柏;陈贤富;黄双福;杨俊;;一种基于模糊近邻标签传递的半监督分类算法[J];微电子学与计算机;2010年02期
2 余海洋;林琛;陈珂;江弋;邹权;;Pass-Join-K:多分段匹配的相似性连接算法[J];计算机科学与探索;2013年10期
3 刘雪莉;王宏志;李建中;高宏;;实体数据库中多相似连接顺序选择策略[J];计算机科学与探索;2012年10期
4 ;[J];;年期
5 ;[J];;年期
6 ;[J];;年期
7 ;[J];;年期
8 ;[J];;年期
9 ;[J];;年期
10 ;[J];;年期
相关硕士学位论文 前3条
1 刘雪莉;基于实体的相似性连接操作的研究[D];哈尔滨工业大学;2012年
2 周健雯;异质网络上的自相似性连接算法研究与实现[D];复旦大学;2013年
3 徐媛媛;基于MapReduce的相似性连接研究[D];宁波大学;2014年
,本文编号:1957056
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1957056.html