Deep Web数据源聚焦爬虫

发布时间：2018-11-21 18:28

【摘要】：Internet上有大量页面是由后台数据库动态产生的,这部分页面不能通过传统的搜索引擎访问,被称为Deep Web。数据源发现是大规模Deep Web数据源集成的关键步骤。该文提出一种针对DeepWeb数据源的聚焦爬行算法。在评价链接重要性时,综合考虑了页面与主题的相关性和链接相关信息。实验证明该方法是有效的。
[Abstract]:A large number of pages on Internet are dynamically generated by backstage databases that cannot be accessed by traditional search engines and are known as Deep Web. Data source discovery is a key step in large scale Deep Web data source integration. This paper presents a focused crawling algorithm for DeepWeb data sources. In evaluating the importance of links, the relevance of the page to the topic and link-related information are taken into account. Experiments show that the method is effective.
【作者单位】：苏州大学智能信息处理及应用研究所苏州大学智能信息处理及应用研究所苏州大学智能信息处理及应用研究所
【基金】：国家自然科学基金资助项目(60673092) 2005年度教育部科研基金资助重点项目(205059) 教育部高校博士学科点科研基金资助项目(20040285016) 江苏省高技术研究计划基金资助项目(BG2005019)
【分类号】：TP393.09

【相似文献】