基于XML用户自定义需求的WEB信息提取研究

发布时间：2019-07-01 18:38

【摘要】：随着近些年互联网的飞速发展,Internet已经发展成为一个庞大的发布和共享信息资源的平台。但是如何从海量、无结构或半结构化的数据中快速、高效地获取用户所需的信息仍然是亟待解决的热点问题,因此WEB信息提取技术应运而生。目前学者们已经进行了大量的研究工作,但现有的技术仍然存在诸多不足之处：提取方法过于专业,不仅增加了用户语义理解的负担,而且不便于用户使用；在提取过程中难以及时获取用户的反馈,影响提取效果；提取内容越复杂,提取规则的健壮性越差。基于此,本文在对XML及相关标准和现有基于XML提取方法深入研究的基础上,提出了一种基于XML用户自定义需求的WEB信息提取方法。研究工作包括为以下几方面内容： (1)对待提取页面进行处理。HTML页面经过预处理过滤掉无关信息和代码,转换为格式规范的XML文档,为使用户清晰掌握页面结构,将XML文档解析生成可视化的DOM树形式,在节点转换的过程中,标记每个节点类型,并计算其路径表达式,为样本映射和生成提取规则做准备。 (2)实现用户的提取需求的获取。研究通过定义目标描述待提取数据节点间的层次关系,并且以此作为提取信息输出时的样式结构。用户标记的样本则作为提取规则的生成依据,样本按照映射规则以结构映射或内容映射的方式向目标结构映射,从而得到待提取数据的节点类型信息和位置信息。 (3)实现提取规则的构造。提取规则由一个或多个匹配目标结构每层节点的模板构成。模板根据目标结构根节点是否存在结构映射分别进行构造。根节点存在结构映射,利用样本结构映射的class属性匹配全文同类别节点,并利用相对路径覆盖父子关系和祖先后代关系,递归生成每层节点模板。根节点不存在结构映射,通过其子节点获取公共路径作为模板匹配的起点,由于该起点位置是唯一的,因此提取仅为样本数据。最后通过对比实验,验证了本文提取方法的有效性,证明了该方法提取效果优于现有的两种方法。当提取内容结构复杂时,提取规则具有较好的健壮性。同时实现了该方法的原型系统,通过系统演示表明,用户不仅能够直观的观测到信息提取的整个过程,而且可以及时确定提取结果是否准确并能够方便地进行修改。
[Abstract]:With the rapid development of the Internet in recent years, Internet has become a huge platform for publishing and sharing information resources. However, how to obtain the information needed by users quickly and efficiently from massive, unstructured or semi-structured data is still a hot issue to be solved, so WEB information extraction technology emerges as the times require. At present, scholars have done a lot of research work, but the existing technology still has many shortcomings: the extraction method is too professional, not only increases the burden of user semantic understanding, but also is not easy for users to use; in the extraction process, it is difficult to obtain user feedback in time, affecting the extraction effect; the more complex the extraction content, the worse the robustness of the extraction rules. Based on this, based on the in-depth study of XML and related standards and the existing XML extraction methods, a WEB information extraction method based on XML user custom requirements is proposed in this paper. The research work includes the following aspects: (1) the extracted page is processed. The HTML page filters out the unrelated information and code after preprocessing and converts it into a format-standardized XML document. In order to make the user clearly master the page structure, the XML document is parsed to generate a visual DOM tree form. In the process of node conversion, each node type is marked and its path expression is calculated. Prepare for sample mapping and generation of extraction rules. (2) to realize the acquisition of users' extraction requirements. In this paper, the hierarchical relationship between the data nodes to be extracted is described by defining the target, and it is used as the style structure of the extraction information output. The sample of user tag is used as the basis of extraction rule generation, and the sample maps to the target structure in the way of structure mapping or content mapping according to the mapping rule, so as to obtain the node type information and location information of the data to be extracted. (3) the construction of extraction rules is realized. The extraction rule consists of one or more templates for each layer of the matching target structure. The template is constructed according to whether there is a structural mapping in the root node of the target structure. There is a structural mapping in the root node. The class attribute of the sample structure mapping is used to match the full text node of the same class, and the relative path is used to cover the parent-child relationship and the ancestor and descendant relationship, and each layer of node template is generated recursively. There is no structure mapping in the root node, and the common path is obtained by its child nodes as the starting point of template matching. Because the starting point position is unique, the extraction is only sample data. Finally, the effectiveness of the proposed method is verified by comparative experiments, and it is proved that the extraction effect of this method is better than that of the existing two methods. When the extraction content structure is complex, the extraction rules have good robustness. At the same time, the prototype system of the method is realized, and the system demonstration shows that the user can not only intuitively observe the whole process of information extraction, but also determine whether the extraction result is accurate and can be modified conveniently.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.1;TP393.092

【参考文献】