基于概念和语义网络的近似网页检测算法

发布时间：2018-03-29 01:35

本文选题：网页去重算法　切入点：小世界网络　出处：《软件学报》2011年08期

【摘要】：在搜索引擎的检索结果页面中,用户经常会得到内容近似的网页.为了提高检索整体性能和用户满意度,提出了一种基于概念和语义网络的近似网页检测算法DWDCS(near-duplicate webpages detection based on concept and semantic network).改进了经典基于小世界理论提取文档关键词的算法.首先对文档概念进行抽取和归并,不但解决了"表达差异"问题,而且有效降低了语义网络的复杂度;从网络结构的几何特征对其进行分析,同时利用网页的语法和结构信息构建特征向量进行文档相似度的计算,由于无须使用语料库,使得算法天生具有领域无关的优点.实验结果表明,与经典的网页去重算法(I-Match)和单纯依赖词汇共现小世界模型的算法相比,DWDCS具有很好的抵抗噪声的能力,在大规模实验中获得了准确率90%和召回率85%的良好测试结果.良好的时空间复杂度及算法性能不依赖于语料库的优点,使其在大规模网页去重实际应用中获得了良好的效果.
[Abstract]:In search engine search results pages, users often get pages with similar content. In order to improve the overall performance and user satisfaction, In this paper, an approximate web page detection algorithm based on concept and semantic network, DWDCS(near-duplicate webpages detection based on concept and semantic network, is proposed. The classical algorithm for extracting document keywords based on small-world theory is improved. Firstly, the concept of document is extracted and merged. It not only solves the problem of "expression difference", but also reduces the complexity of semantic network effectively. At the same time, using the syntax and structure information of web pages to construct feature vectors to calculate document similarity, the algorithm has the advantage of domain independence because it does not need to use corpus. The experimental results show that, Compared with the classical webpage de-duplication algorithm (I-Match) and the algorithm based on lexical co-occurrence small-world model, DWDCS has a good ability to resist noise. A good test result of 90% accuracy and 85% recall rate is obtained in the large-scale experiment. The good time space complexity and the performance of the algorithm are independent of the advantages of the corpus, which makes it obtain good results in the practical application of the large-scale web page de-emphasis.
【作者单位】：北京理工大学计算机科学技术学院;北京航天飞行控制中心;
【基金】：国家自然科学基金(60803050,60705022) 新世纪优秀人才计划(NCET-06-0161)
【分类号】：TP393.092

【参考文献】