一种改进的树路径模型在网页聚类中的研究

发布时间：2018-04-09 22:26

本文选题：信息提取　切入点：网页结构　出处：《计算机科学》2015年05期

【摘要】：相似度计算是文本挖掘的基础,也是信息提取过程的关键步骤。对于结构复杂的网页,当前基于传统树路径模型的相似度计算方法在准确性上尚不完善。传统树路径模型未考虑路径出现的先后顺序,并且比较路径相似度时用的是完全匹配,难以在不完全匹配时更精确地描述路径之间的相似度。因此,从网页结构相似度入手,提出了一种改进的树路径模型。该模型充分考虑了兄弟节点之间的关系、路径位置以及路径权重,弥补了传统树路径模型无法表达文档结构和层次信息的缺陷。实验结果表明,该模型提高了识别网页结构相似性的能力,既能对结构差别较大的网页进行良好的区分,又能较好地反映来自同一模板的网页之间的差异性,同时在网页聚类中具有更优的效果。
[Abstract]:Similarity calculation is the foundation of text mining and the key step of information extraction.For the web pages with complex structure, the accuracy of the traditional tree path model is not perfect.The traditional tree path model does not consider the sequence of path appearance, and it is difficult to describe the similarity between paths more accurately when comparing path similarity with perfect matching.Therefore, an improved tree path model is proposed based on the similarity of web structure.This model fully considers the relationship between brother nodes, path position and path weight, and makes up for the defect that traditional tree path model can not express document structure and hierarchical information.The experimental results show that the model can improve the ability of recognizing the structural similarity of web pages, and can not only distinguish the pages with large structural differences, but also reflect the differences between pages from the same template.At the same time, it has better effect in web clustering.
【作者单位】：河海大学计算机与信息学院;南京航空航天大学计算机科学与技术学院;
【基金】：江苏水利科技项目:“智慧河流”研究及其在六合滁河管理中的应用(2013025) 河海大学中央高校基本科研业务费项目(2009B21614)资助
【分类号】：TP391.1;TP393.092

【参考文献】