基于视觉分块与语义DOM的Deep Web信息抽取研究

发布时间：2018-06-23 07:27

本文选题：数据抽取 + DOM树　；参考：《上海师范大学》2016年硕士论文

【摘要】：隐藏在普通搜索引擎的背后,需要用户提交表单查询并从后台数据库中返回结果页面才能获取到的信息,称为Deep Web。当前对Deep Web数据抽取的研究是一个比较热门的话题。随着页面结构变得越来越复杂,以及动态网页技术的引入,使得Deep Web页面存在异构性和半结构化的特点。如何快速有效地从这些半结构化的结果页面中抽取用户感兴趣的数据以提供特定的服务成为一个难点。目前研究的主要问题包括:(1)如何有效快速地识别噪声信息,使得在对原始页面分析之前尽可能对页面进行清洗;(2)如何根据DOM树结构和页面视觉信息快速定位页面的主数据区域;(3)如何不受页面结构差异的影响尽可能自动地抽取页面数据。针对上述问题,传统的单一的基于DOM树的页面分析方法已经无法满足用户的需求。因为单一的基于DOM树的页面分析方法主要依赖DOM树的结构特征,需要解析页面所有的标签将其转化为DOM树,忽略了页面的一些有效的视觉特征,并且一旦页面的结构发生变化,需要重新对页面的结构进行分析再抽取。目前,微软亚洲研究院提出了一种新的页面数据抽取方法—VIPS算法。VIPS算法打破了以往传统的基于DOM树抽取方法,从人的视觉角度出发,把页面分割为一个个有效的视觉块,并对这些视觉块进行语义重组,形成一棵视觉块树。该算法在DOM树结构和页面的语义之间建立了桥梁。本文通过分析Deep Web结果页面的特点,结合人的视觉特征,在VIPS算法的基础上提出了一种基于基准视觉块的Deep Web信息抽取方法。该方法首先对页面的标签进行了分析,在解析器将Web文档解析成语法树之前,将Web页面一些与主题无关的信息(例如导航栏、广告)等去除,并对优化后的DOM树利用VIPS算法对其进行语义分块,分块后根据坐标位置首先寻找到基准视觉块,以该基准视觉块作为中心位置逆序和顺序遍历DOM树并采用线性特征向量判别法寻找所有相似的视觉块对其进行抽取。从实验效果来看,本文提出的基于基准视觉块的页面数据提取方法具有一定的可行性并在提取数据的准确率方面与传统的方法相比有了一定的提高。
[Abstract]:Hidden behind the ordinary search engine, users need to submit form query and return the result page from the background database to get the information, called Deep Web. At present, the research on Deep Web data extraction is a hot topic. With the increasing complexity of page structure and the introduction of dynamic web technology, Deep Web pages are characterized by heterogeneity and semi-structure. How to quickly and effectively extract data of interest from these semi-structured result pages to provide specific services has become a difficult problem. The main problems are as follows: (1) how to identify noise information effectively and quickly, It can clean the page as much as possible before analyzing the original page; (2) how to quickly locate the main data area of the page according to Dom tree structure and page visual information; (3) how to extract page data as automatically as possible without the influence of page structure difference. To solve the above problems, the traditional single Dom tree based page analysis method can not meet the needs of users. Because a single Dom tree-based page analysis method mainly depends on the Dom tree's structural features, it needs to parse all the tags of the page to transform it into a Dom tree, ignoring some effective visual features of the page, and once the structure of the page changes, The structure of the page needs to be re-analyzed and extracted. At present, Microsoft Asia Research Institute has proposed a new page data extraction method-VIPS algorithm. VIPS algorithm breaks the traditional DOM-based tree extraction method and divides the page into effective visual blocks from the point of view of human vision. And these visual blocks are semantically reorganized to form a visual block tree. The algorithm establishes a bridge between the Dom tree structure and the semantics of the page. Based on the analysis of the features of Deep Web result pages and the visual features of human beings, this paper proposes a method of extracting Deep Web information based on reference visual blocks based on VIPs algorithm. Before parser parses the Web document into a syntax tree, it removes some topic-independent information (such as navigation bar, advertisement) from the Web page. The optimized Dom tree is divided into semantic blocks by using VIPS algorithm, and the reference visual block is first found according to the coordinate position. Taking the reference visual block as the center position, the Dom tree is traversed in reverse order and sequentially, and all similar visual blocks are extracted by linear eigenvector discriminant method. From the experimental results, the proposed page data extraction method based on the benchmark visual block is feasible and the accuracy of data extraction is improved compared with the traditional method.
【学位授予单位】：上海师范大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP393.092

【参考文献】