基于DOM树层次特征的多记录网页抽取

发布时间：2018-07-12 16:11

本文选题：信息抽取 + 多记录网页　；参考：《模式识别与人工智能》2015年02期

【摘要】：现有的多记录网页抽取方法通常是对文件对象模型(DOM)树进行整体纵向结构分析,计算的结构相似度普遍偏低,使其不能正确识别记录区域.文中提出基于DOM树层次特征的记录抽取方法,该方法利用DOM树不同层次节点的不同作用对其进行横向分析,将寻找相似子树的问题转换为寻找节点块的相似子块,最后采用双向拓展搜索非重叠重复子块进行记录分隔.实验表明该方法能抽取现有抽取器无法处理的页面,多个数据源的抽取结果验证其有效性.
[Abstract]:The existing multi-record web page extraction methods usually analyze the whole vertical structure of the file object model (Dom) tree, and the calculated structural similarity is generally low, which makes it unable to identify the recording region correctly. In this paper, a record extraction method based on the hierarchical feature of Dom tree is proposed. The method uses the different functions of different nodes in Dom tree to analyze it horizontally, and the problem of finding similar subtree is transformed into finding similar sub-block of node block. Finally, two-way extended search non-overlapping repeat blocks are used to separate the records. Experiments show that the proposed method can extract pages that cannot be processed by existing extractors, and the results of multiple data sources verify its effectiveness.
【作者单位】：福州大学数学与计算机科学学院;
【基金】：国家自然科学基金青年科学基金项目(No.61300105) 教育部博士点基金联合项目(No.2012351410010) 福建省科技重大专项项目(No.2013H6012) 福州市科技计划项目(No.2013-PT-45)资助
【分类号】：TP393.092;TP391.1

【相似文献】