基于标记树对象抽取技术的Hidden Web获取研究

发布时间：2018-02-04 19:56

本文关键词： Hidden Web 信息检索对象抽取结构化查询标记树　出处：《计算机工程与应用》2002年23期 　论文类型：期刊论文

【摘要】：目前标准的搜索引擎能够检索的仅仅是WorldWideWeb提供的小部分称为可索引的Web信息。大量的HiddenWeb信息(估计容量是可索引Web的500倍)对这些搜索引擎是不可见的。这些信息隐藏在Web页面的搜索表单后面,保存在大型的动态数据库中。该文提出了一套检索HiddenWeb信息的方法,给出了系统的框架结构,并详细讨论了实现的关键技术。系统采用新的基于标记树的对象抽取(Tag-Tree-basedObjectExtraction)方法自动地从Web页面中抽取HiddenWeb信息,然后在此基础上给出了结构化的HiddenWeb信息查询算法。文章最后对实验结果进行了讨论。
[Abstract]:The current standard search engine is able to retrieve only a small portion of the Web information provided by WorldWideWeb called indexed. A large amount of HiddenWeb information. The estimated capacity is 500 times that of an indexed Web) that is not visible to these search engines. This information is hidden behind the search form on the Web page. This paper presents a set of methods for retrieving HiddenWeb information and gives the framework of the system. The key technology of the implementation is discussed in detail. The system adopts a new object extraction based on tag tree and Tag-Tree-Based object Extraction-based (Tag-Tree-based object Extraction). Method automatically extracts HiddenWeb information from a Web page. Then a structured HiddenWeb information query algorithm is presented. Finally, the experimental results are discussed.
【作者单位】：上海交通大学计算机系上海交通大学计算机系上海交通大学计算机系上海交通大学计算机系
【基金】：国家自然科学基金重大国际合作项目资助(编号:60221120145)
【分类号】：TP391.4
【正文快照】： １引言今天，，人们已经习惯于通过搜索引擎从网上查找信息。目前，主流的搜索引擎基本上只收集了互联网上部分称为publiclyindexableWebrぃ常磖Γǹ伤饕┑男畔ⅰＵ獠糠中畔⑹怯蒀rawler按照某种控制策略，沿着Web页面的超链接图下载的Web页面集合，通常也被称为静态页面集

【参考文献】