一种基于视觉特征的Deep Web信息抽取方法

发布时间：2018-10-09 07:16

【摘要】：随着Web数据库的不断增长,大量网络信息通过普通搜索引擎难以满足用户的需求,需要用户提交表单查询并从后台数据库中返回结果页面才能获取到想要的信息,称为Deep Web。因此如何有效地抽取这些实体信息成为一个值得研究的问题。论文通过分析Deep Web结果页面的特点,结合人的视觉特征,提出了一种基于视觉特征的Deep Web信息抽取方法。该方法充分利用了人的视觉特征,在解析器将Web文档解析成语法树之前,将Web页面一些与主题无关的信息(例如导航栏、广告)等去除,并对优化后的DOM树利用VIPS算法对其进行语义分块,分块后根据位置特征首先寻找到基准视觉块,以该基准视觉块作为中心位置逆序和顺序遍历DOM树寻找所有相似的视觉块并对其进行抽取。从实验效果来看,该方法从提取信息速度和提取信息的准确率和完整率方面与传统方法相比都有一定的提高。
[Abstract]:With the continuous growth of Web database, a large amount of network information can not meet the needs of users through the ordinary search engine. It requires users to submit form queries and return the results page from the background database to get the desired information, called Deep Web.. Therefore, how to extract these entity information effectively becomes a problem worth studying. By analyzing the characteristics of the Deep Web result page and combining the human visual features, a Deep Web information extraction method based on visual features is proposed in this paper. This method makes full use of human visual features. Before parser parses Web documents into syntax trees, it removes some topic-independent information (such as navigation bar, advertising) from Web pages. The optimized DOM tree is divided into semantic blocks by using VIPS algorithm, and the reference visual blocks are first found according to the location features. The reference visual block is used as the center position to traverse the DOM tree in reverse order and order to find all the similar visual blocks and extract them. The experimental results show that this method can improve the speed of information extraction and the accuracy and completeness of information extraction.
【作者单位】：上海师范大学;
【分类号】：TP391.1

【相似文献】