Deep Web爬虫爬行策略研究

发布时间：2018-06-29 22:43

本文选题：DeepWeb + DeepWeb爬虫　；参考：《计算机工程与设计》2006年17期

【摘要】：如今Web上越来越多的信息可以通过查询接口来获得,为了获取某DeepWeb站点的页面用户不得不键入一系列的关键词集。由于没有直接指向DeepWeb页面的静态链接,当前大多搜索引擎不能发现和索引这些页面。然而,近来研究表明DeepWeb站点提供的高质量的信息对许多用户来说是非常有价值。这里研究了怎样建立起一个有效的DeepWeb爬虫,它可以自动发现和下载DeepWeb页面。由于DeepWeb惟一“入口点”是查询接口,DeepWeb爬虫设计面对的主要挑战是怎样对查询接口自动产生有意义的查询。这里提出一种针对查询接口查询自动产生问题的理论框架。通过在实际DeepWeb站点上的实验证明了此方法是非常有
[Abstract]:Nowadays, more and more information on the Web can be obtained through the query interface. In order to obtain the page users of a certain Web site, they have to type a series of keyword sets. Because there are no static links to DeepWeb pages, most search engines are unable to find and index these pages. However, recent studies have shown that high quality information provided by DeepWeb sites is of great value to many users. This paper studies how to build an effective DeepWeb crawler, which can automatically discover and download DeepWeb pages. Because the only "entry point" of DeepWeb is the main challenge in the design of query interface DeepWeb crawler is how to automatically generate meaningful queries on query interfaces. This paper presents a theoretical framework for automatic problem generation of query interface query. Experiments on the actual DeepWeb site show that this method is very useful.
【作者单位】：苏州大学智能信息处理及应用研究所苏州大学智能信息处理及应用研究所
【基金】：教育部高校博士学科点科研基金项目(20040285016) 江苏省高技术研究基金项目(BG2005019)。
【分类号】：TP393.092

【相似文献】