针对模板生成网页的数据自动抽取方法的探讨与应用

发布时间：2018-03-28 14:02

本文选题：Web信息抽取技术　切入点：网页模板　出处：《华东师范大学》2009年硕士论文

【摘要】： 随着Internet的迅速发展,互联网已成为一个巨大的信息库,为了有效地利用互联网上的信息,各种Web信息抽取技术应运而生。目前,Web上的很多网页是网站根据用户请求从后台数据库中选取数据并嵌入到通用的模板中,并结合网站的特定需求而动态生成的,例如电子商务网站的商品描述网页等。针对如何从这类由模板生成的网页中自动地抽取出有效数据的问题,目前常用的经典方法有RoadRunner,EXALG等,其中RoadRunner的实现算法的时间复杂度呈指数级增长,其实用性不强;虽然EXALG方法对RoadRunner方法进行了有效的改进,但是仍然缺乏对网页中可视化布局信息、标记属性和字符串的相似度等重要特征的考虑。因此,本文针对上述这些问题研讨了相关网页模板检测问题的形式化描述,结合该类网页的结构特征,探讨了一种新的模板检测方法;并且利用检测出的模板完成对相关实例网页的数据自动抽取过程;最终将该基于有效模板检测的网页数据自动抽取算法应用于某电子商务网站的相关网页的数据抽取过程中,即对某网站中的商品列表信息和商品详细信息等重要数据实现了自动抽取的工作。与其他方法相比,该方法能够适应于“列表页面”和“详细页面”两种类型的网页,在该类网页数据抽取的查全率和准确率方面有了较大的改进。本文的主要内容和结构安排如下: 首先,介绍针对模板生成网页的数据抽取方法的发展现状以及相关技术,并阐述了论文的研究目标和工作内容。其次,介绍了Web数据抽取过程中主流的网页数据抽取技术,系统地剖析了目前广泛采用的经典的网页数据抽取技术中存在的优势与不足,以此为基础,文中研讨了一种有效的针对模板生成网页的数据抽取方法及其实现算法,即针对该类网页,完成了相应网页有效数据的自动抽取工作。接着,重点阐述了文中所研讨的针对模板生成网页的数据自动抽取算法的设计与实现过程。该算法首先将已经净化的HTML页面解析成标签树和标签队列两种数据结构;其次针对大部分网页中存在导航条、广告及版本信息等一些与抽取内容无关的数据信息,采用文中所提出的具体有效的标签树匹配算法过滤上述无关/冗余的数据信息;然后通过该数据自动抽取算法中计算Ctokens的核心子算法将这类HTML页面进行有效的标签归类,以期基于所生成的Ctokens来自动抽取出该类网页的模板结构信息数据和字段层次上的有效网页生成数据。最后,根据文中所研讨的方法及实现算法,尝试性地构造了一个针对模板生成网页的数据自动抽取原型系统,该系统能够完成对相关电子商务网站中该类网页(如:商品的“列表页面”和“详细页面”的具体网页)的有效数据的自动抽裙ぷ?该抽取过程的查全率和准确率都有较大的改进,所完成的工作是具有广泛实际需求和深入推广应用价值的。
[Abstract]:With the rapid development of Internet, the Internet has become a huge information base, in order to effectively use the information on the Internet, Web information extraction technology came into being. At present, a lot of Web pages is the site according to the request of the user selected data and embedded into the general template from databases, and websites with specific needs dynamically generated, such as electronic commerce website ". According to the description of the goods from the template generated web pages automatically extract the valid data, the classical methods of RoadRunner, EXALG and RoadRunner, which realized the time complexity of the algorithm grows exponentially, in fact is not strong; although the EXALG method the RoadRunner method is improved effectively, but there is still a lack of information visualization in web page layout, tag attributes and string similarity Other important features are considered. Therefore, aiming at these issues related web page template detection problem is formalized, combined with the structure characteristics of the web page, and discusses a new template detection method; and use the detected templates to complete automatic data extraction process of relevant examples of Web data extraction process; the web application based on web data template detection algorithm effective automatic extraction in an e-commerce site in the list of goods and merchandise information with information and other important data on a web site in the automatic extraction work. Compared with other methods, this method can be applied to the list of "pages" and "detail page" two types ", have been greatly improved in the aspect of the web data extraction recall and accuracy.
The main contents and structure of this paper are as follows:
First of all, this paper introduces the development status of data extraction method for template generation of web pages and related technologies, and expounds the research objectives and work content of the paper.
Secondly, introduces the web data extraction technology of Web data extraction process, systematically analyzes the existing web data extraction technology is widely used in the classical advantages and disadvantages, on this basis, this paper presents an effective template generated web pages data extraction method and algorithm for the class ", completed the work to automatically extract the corresponding page valid data.
Then, focuses on the design and implementation of template generated web pages automatic data extraction algorithm research in this paper. Firstly, HTML parsor had purified into two kinds of label label tree and queue data structure; secondly, there is a majority of web page navigation, independent advertising and version information and some the contents of the selected data, using the specific effective label tree is proposed in this paper, the algorithm of filtering irrelevant / redundant information; and then through the data extraction algorithm in computing core algorithm Ctokens the HTML page for effective label classification, which based on the generated Ctokens to automatically extract the data generated effective web template structure information of the data of the web page and the field level.
Finally, according to the studied method and algorithm of this paper attempts to construct a template generated web pages automatic data extraction prototype system, the system can complete the related e-commerce website in the web page (such as: "specific products" list "and" detail page ") automatic extraction the work of the skirt of the effective data extraction process? The recall and precision are greatly improved, the completion of the work has wide actual demand and thorough promotion application value.

【学位授予单位】：华东师范大学
【学位级别】：硕士
【学位授予年份】：2009
【分类号】：TP393.092

【参考文献】