基于规则的论坛爬取与抽取一体化

发布时间：2018-03-10 18:01

本文选题：Web数据管理　切入点：数据爬取　出处：《华东师范大学》2011年硕士论文　论文类型：学位论文

【摘要】：近些年来,论坛、博客和微博等相继出现在互联网应用中,并逐渐成为人们发布互联网信息的主要方式。其中,论坛已成为信息发布、共享和传播的重要平台。论坛的内容由普通用户创建和发布,对于舆情分析、互联网广告推荐等应用具有重要意义。数据爬取(Data Crawler)是数据分析和应用的前提。传统的爬取技术以页面为单位爬取网页,并将数据的处理与分析放在网页爬取之后。这种方式不适合论坛数据的爬取。这主要是有以下两方面的原因：首先,论坛数据具有较强的结构性。传统的爬虫以单个页面为单位进行数据的爬取,忽略了论坛页面的内在结构和页面间的关联。其次,数据大都隐藏于网络页面的结构中。传统的爬虫保存页面的完整信息,不对页面进行数据处理。因此,本文提出了一种新的数据爬取和信息抽取一体化的论坛数据爬取方法,并在该方法的基础上设计与实现了InForCE系统。该系统分析论坛导航页面的结构和内容,以此进行帖子页面爬取任务的调度,并按照论坛内容对爬取的数据进行组织与管理。InForCE系统由爬虫、HTML解析器、链接池、学习器和规则库组成：爬虫用于爬取网页。HTML解析器将HTML页面转化为用于信息抽取的XHTML页面。链接池用于判断系统的调度策略。规则学习器和规则库用于页面的信息抽取。本文的主要贡献总结如下： 1.将页面爬取、结构分析和内容抽取相结合,并根据信息单元(而不是页面)对爬取任务进行调度,对爬取的数据进行管理。信息单元是一个帖子的所有信息。论坛页面类型包括导航页面和帖子页面。导航页面以列表的形式展示了所有讨论的主题。帖子页面显示主题和关于主题的跟帖。导航页面的内容决定帖子页面的爬取调度策略,并将同一个帖子的所有内容组织在同一个文档中。 2.提出了一种基于XML和XPath模式的描述性模式映射规则,并将其用于论坛数据的抽取与转化。XPath模式表示一组XPath的特征。它被用于定义模式映射规则。模式映射规则表示从源文档(通常为XHTML格式)到目标文档(通常为XML格式)的数据映射关系。 3.使用规则学习器简化信息抽取的过程。通过机器学习的方式获取模式映射规则,并将其自动转化为XSLT,从而实现从论坛页面到最终结果的转换。规则的自动转化使不具有XSLT知识的用户也能够快速完成数据的抽取任务。综上所述,我们分析了论坛数据获取过程中存在的问题,并针对论坛的数据特征设计了InForCE系统。本文以篱笆论坛为实验,定义数据抽取模型,学习模式映射规则,并进行论坛数据的爬取和抽取。目前,InForCE系统能够成功的运行在篱笆论坛和搜房论坛上,获得的论坛页面达到380G,抽取的论坛数据达到40G。最后,通过实验证明该系统能够高效的爬取、抽取和组织论坛数据。
[Abstract]:In recent years, forums, blogs and Weibo have appeared in Internet applications one after another, and have gradually become the main way for people to publish information on the Internet. The content of the forum is created and published by ordinary users, which is of great significance to the application of public opinion analysis, Internet advertising recommendation and so on. Data crawling data Crawler is the premise of data analysis and application. And put the data processing and analysis after the web crawling. This method is not suitable for the crawling of forum data. This is mainly for the following two reasons: first, The traditional crawler crawls the data on a single page, neglecting the internal structure of the forum page and the correlation between the pages. Secondly, Most of the data are hidden in the structure of the web page. The traditional crawler saves the complete information of the page and does not deal with the data of the page. Therefore, this paper proposes a new method of data crawling and information extraction, and designs and implements a InForCE system based on this method. The system analyzes the structure and content of the forum navigation page. According to the content of the forum, the crawling data is organized and managed by the crawler HTML parser and link pool. Learner and rule base: crawler is used to crawl web page. HTML parser transforms HTML page into XHTML page for information extraction. Link pool is used to judge system scheduling strategy. Rule learner and rule base are used for page information extraction. The main contributions of this paper are summarized as follows:. 1. Combine page crawling, structure analysis and content extraction, and schedule crawling tasks according to the information unit (not the page). Manage crawling data. The information unit is all the information for a post. Forum page types include navigation pages and post pages. Navigation pages show all the topics discussed in a list. Post pages display. The content of the navigation page determines the crawling and scheduling strategy of the post page, And organize all the content of the same post in the same document. 2. A descriptive schema mapping rule based on XML and XPath schema is proposed. It is used to extract and transform the forum data. XPath schema represents the characteristics of a set of XPath. It is used to define schema mapping rules. Schema mapping rules represent from the source document (usually in XHTML format) to the target document (usually XML). Format). 3. Using rule learner to simplify the process of information extraction. The automatic transformation of the rules from the forum page to the final result enables users who do not have XSLT knowledge to quickly complete the task of data extraction. To sum up, we analyze the problems existing in the process of data acquisition, and design a InForCE system based on the data features of the forum. In this paper, we define the model of data extraction and the rules of learning pattern mapping by taking the fencing forum as an experiment. At present, the InForCE system can run successfully on the fencing forum and the search room forum, the forum pages obtained reach 380G, and the extracted forum data reaches 40G. finally, Experiments show that the system can crawl, extract and organize forum data efficiently.
【学位授予单位】：华东师范大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：TP393.092

【共引文献】