基于视觉信息和DOM树的Deep Web数据自动抽取

发布时间：2018-08-07 18:50

【摘要】：随着互联网的飞速发展，其中已蕴含了海量的信息资源，涵盖了现实世界的各个领域。相对于Surface Web，，Deep Web蕴含着更丰富的数据、拥有更多的访问量和更快的增长速度。但是Deep Web页面是动态生成的，难以被传统搜索引擎索引到。因此，如何有效地获取和利用Deep Web页面的数据成为一个重要的研究方向。Deep Web数据通过查询结果页面表现出来，但是网页中的数据形式各异、缺乏结构性，便于用户浏览却难以利用。本文基于网页的视觉信息和DOM树结构，对Deep Web查询结果页面的数据自动抽取进行了研究，主要研究内容如下： (1)定位数据区域。首先通过分析Deep Web查询结果页面中数据区域的特点，找到能够使之定位的视觉特征。然后收集了相关页面作为样本，并对样本中的节点进行手工标注。通过Weka训练得到相应的决策树，最后使用该决策树对应的规则来定位数据区域。 (2)抽取数据记录。这个过程分为两步：定位数据记录和去噪。第一步，根据网页中数据记录的DOM树的结构特点及其视觉特征，提出了数据记录定位算法，但是由此得到的节点中不仅包含了数据记录节点，还有少量的噪音；第二步，通过xpath定义了数据记录的相似度，并通过相似度比较进行去噪，从而得到数据记录节点。 (3)对齐数据项。首先将数据记录划分成相应的数据项，然后为便于对齐设计了相应的数据结构，并基于xpath给出了对齐数据项的算法。 (4)模板。针对数据区域、数据记录以及数据项各自的特点，提出了相应的模板。通过模板的使用，不仅在抽取过程中避免了大量重复的计算，提高了抽取速度，而且方便实现连续页面的数据项抽取。论文的创新点如下：(1)引入了xpath的概念，通过xpath定义了数据记录的相似度，从而进行数据记录的去噪。并通过xpath的比较完成了数据项的对齐。(2)提出了数据项粒度的概念，并给出了将数据记录划分为数据项的相应方法。在以上研究的基础上，设计开发了Deep Web查询结果页面的数据自动抽取系统，并且解决了抽取过程中遇到的其他问题。如AJAX异步数据的抽取等。实验表明，本文方法可以快速、准确地从Deep Web查询结果页面中抽取数据。
[Abstract]:With the rapid development of the Internet, it contains a large amount of information resources, covering all fields of the real world. Surface Deep Web contains more data, more traffic and faster growth. However, Deep Web pages are dynamically generated and are difficult to be indexed by traditional search engines. Therefore, how to effectively obtain and utilize the data of Deep Web pages has become an important research direction. Deep Web data is expressed through the query results page, but the data in the web pages are different in form and lack of structure. Easy for users to browse but difficult to use. Based on the visual information of web pages and the structure of DOM tree, this paper studies the automatic data extraction of Deep Web query results page. The main research contents are as follows: (1) locating data regions. Firstly, by analyzing the characteristics of the data region in the Deep Web query result page, we find out the visual features that can make it locate. Then the relevant pages are collected as samples and the nodes in the samples are annotated manually. The corresponding decision tree is obtained by Weka training. Finally, the corresponding rules of the decision tree are used to locate the data region. (2) data records are extracted. This process is divided into two steps: locating data recording and denoising. In the first step, according to the structure and visual characteristics of the DOM tree of the data record in the web page, a data record location algorithm is proposed, but the node obtained from this algorithm contains not only the data record node, but also a little noise. The similarity of data record is defined by xpath, and the data record node is obtained by comparison of similarity. (3) data items are aligned. Firstly, the data record is divided into corresponding data items, then the corresponding data structure is designed to facilitate alignment, and an algorithm for aligning data items is given based on xpath. (4) template. According to the characteristics of data region, data record and data item, the corresponding template is put forward. Through the use of templates, not only a large number of repeated calculations are avoided in the process of extraction, but also the extraction speed is improved, and it is convenient to extract data items from continuous pages. The innovations of this paper are as follows: (1) the concept of xpath is introduced and the similarity of data records is defined by xpath. Through the comparison of xpath, the alignment of data items is completed. (2) the concept of data item granularity is proposed, and the corresponding method of dividing data records into data items is given. Based on the above research, an automatic data extraction system for Deep Web query results page is designed and developed, and other problems encountered in the extraction process are solved. Such as AJAX asynchronous data extraction. Experiments show that this method can extract data from Deep Web query pages quickly and accurately.
【学位授予单位】：中国海洋大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】