网页中实体表格信息抽取方法的研究

发布时间：2018-03-18 07:29

本文选题：本体生成　切入点：信息提取　出处：《北京工业大学》2016年硕士论文　论文类型：学位论文

【摘要】：随着互联网的迅猛发展,网页的信息量呈指数型增长,逐页浏览信息已经不能满足人们的要求,信息抽取技术应运而生。信息抽取技术使人们不用进一步人工筛选符合自己需求的内容而是直接帮助人们从海量网络数据中获取有价值的信息。网页信息提取技术主要围绕两个方向展开,包装器和结构识别。前者的缺点在于对网页的结构依赖性强,可重用性差,通用性差。本文则是结构识别的一种,该方法对网页中半结构化信息能良好的定位和识别,并且对大多数网页具有通用性,生成的结果能直接应用于本体生成,实用价值高。本文所研究的抽取系统中实现的爬虫是一个增量型的、深度优先爬取的定向爬虫。它通过配置文件来生成爬取任务,一个配置文件对应一个爬取任务。配置文件有特定的格式和配置字段,由人工编辑生成,只需配置大约十多个字段,就可以完成对于特定网站、特定领域、特定主题的内容的定向爬取配置。对网页进行清洗之后,本文针对有TABLE标签的表格提出了基于启发式规则的实体定位算法和基于网页URL归类的实体定位算法。基于标签特征、表格结构特征、表格内容特征本文总结了六条规则,依次通过对六条规则生成字符串,然后采用有穷自动机来识别字符串,最后根据停留在不同的状态判断是否是真表格。为提高定位的准确度,本文提出了URL归类实体定位法,通过对URL的类别分类,能将不含有表格的网页去除。这两种方法的结合使得表格定位具有较高的准确度。同时,本文针对有特殊符号的无TABLE标签的表格制定了启发式规则,针对用标签组织的无TABLE标签的表格提出了基于DOM树和启发式规则相结合的定位方法。在表格结构识别中,本文通过对表格属性名和属性值类型的不同构建了类型树,通过计算单元格之间的类型差异判断出表格的展开方式。同时,本文提出了将表格数字化,通过计算单元格之间长度差异判断出表格的展开方式,将两者判断的结果赋予不同的权值,最终判别出表格为横向展开还是纵向展开。并且本文根据类型差异和结构差异判断出表头所跨越的行数或列数。
[Abstract]:With the rapid development of the Internet, the amount of information on web pages is increasing exponentially. Browsing information page by page can no longer meet the requirements of people. Information extraction technology arises as the times require. Information extraction technology enables people to obtain valuable information directly from massive network data without further manual screening of content that meets their own needs. The technique mainly revolves around two directions. Wrapper and structure recognition. The former has the disadvantages of strong structural dependence, poor reusability and poor versatility. This paper is a kind of structure recognition method, which can locate and recognize the semi-structured information in web pages. The result can be directly applied to ontology generation, which is of high practical value. The crawler implemented in the extraction system studied in this paper is an incremental one. Deep-first crawling oriented crawler. It generates crawling tasks through configuration files, and a configuration file corresponds to a crawling task. The profile has a specific format and configuration field, which is generated by manual editing. With only about a dozen fields configured, you can complete the directed crawling configuration for the content of a particular site, domain, or topic. In this paper, an entity location algorithm based on heuristic rules and an entity location algorithm based on web page URL categorization are proposed for tables with TABLE tags. This paper summarizes six rules based on label features, table structure features and table content features. In order to improve the accuracy of localization, the URL classifying entity localization method is proposed in this paper. The string is generated by six rules in turn, then the finite automata are used to identify the strings. Finally, according to the different states, the paper determines whether the string is true or not. By classifying the URL categories, the web pages without tables can be removed. The combination of these two methods makes the table positioning more accurate. At the same time, this paper formulates heuristic rules for tables without TABLE tags with special symbols. Based on the combination of DOM tree and heuristic rules, this paper proposes a new method to locate tables without TABLE tags organized by tags. In the recognition of table structure, a type tree is constructed by different attribute names and attribute value types. At the same time, this paper proposes to digitize the table and calculate the length difference between cells to determine the expansion mode of the table. The results of the two judgments are given different weights, and finally the table is determined to be horizontal or vertical, and the number of rows or columns crossed by the header is determined according to the type difference and the structure difference.
【学位授予单位】：北京工业大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】