Web论坛数据抽取

发布时间：2018-05-26 16:41

本文选题：论坛数据抽取 + 用户生成内容　；参考：《华东师范大学》2012年博士论文

【摘要】：Web2.0为用户提供了丰富的应用,大量用户的深度参与使Web正演变成一个生态系统。在向用户展示信息的同时,Web2.0也吸引着用户贡献大量内容,这些用户生成的内容蕴含巨大的价值。作为一种典型的Web2.0应用,论坛为用户提供了一个信息获取与交流的平台。用户在论坛上发布信息和评论,例如介绍产品使用心得、交流生活感悟、讨论学校教育、发布社会新闻等,这些内容真实地反映了用户的需求、观点以及社会现象等。如何将论坛数据从Web页面中抽取出来,以支持商品推荐、专家发现、舆情监控等应用具有很强的研究与现实意义。论坛数据较为复杂,它不仅包含用户生成内容,还包括推荐、广告等噪音数据；此外,各论坛站点风格也存在较大差异,这为论坛数据抽取带来了挑战。传统的Web数据抽取技术通常面向相对规整的结构化数据,并不适用于论坛数据抽取,因而需要研究面向论坛数据的高效的抽取技术。本文的主要贡献包括以下几个方面： ·提出了一种整合归纳逻辑程序设计和XPath模式学习的论坛数据抽取方法,该方法具有较高的准确率和召回率。该方法充分考虑了论坛页面的结构特征,引入新谓词,以整合逻辑程序表达式和XPath模式,采用分而治之的方法来学习XPath模式,以描述目标数据的结构特征。最后,将学习的XPath模式规则转换成XSLT文件,从而把抽取的论坛数据按照预定义的模型存储起来,以实现论坛数据的自动抽取。 ·提出了一种非监督的论坛数据抽取方法,该方法充分考虑了Web页面的结构特征和页而间联系,显著提升了抽取的自动化程度。基于同一论坛站点页面的结构具有相似性的特点,采用多页面联合比较的方法,将Web页面划分成稳定区域和非稳定区域,并通过页面级过滤和模板级过滤移除Web页面的大多数噪音数据。然后利用稳定区域中路径和非稳定区域中路径的相互关系,引入路径伴随距离和相似度计算路径之间的依赖关系,从而判定一个路径是否属于抽取目标的路径,实现论坛帖子内容的自动抽取。 ·提出了一种非监督的论坛数据抽取规则生成方法,该方法充分考虑了Web页面的结构和页面内容特征,提升了对不同论坛的适应能力,保证了帖子抽取的完整性。本方法是一个两阶段的抽取规则生成方法,同时开采了Web页面结构、用户发布帖子和论坛常规性的冗余信息三者的特征。在用户信息处理阶段,通过Web页而常规性的冗余信息获取用户区域,并发现用户区域中的最大子结构,从而获得用户信息：在帖子内容处理阶段,将用户区域转换成关系表中的记录,根据属性间的函数依赖关系来区分帖子内容和噪音数据。最后,将两个阶段获取内容对应的路径归纳成以正则树结构表示的抽取规则。综上所述,本文从不同的需求出发提出了三种论坛数据抽取方法。第一种方法采用有监督的抽取规则学习模式,能够获得较好的准确率和召回率,比较适用于小规模的论坛数据集合；第二种方法是非监督的抽取方法,直接从Web页面抽取数据,不显式地输出抽取规则,适用于较大规模的论坛数据集合；第三种方法也是非监督的方法,它首先学习抽取规则,然后基于规则抽取数据,兼顾了规则生成的自动化和抽取性能,能适应更大规模的数据集合。基于真实论坛数据的实验表明,上述方法能有效地从各种论坛中抽取数据。
[Abstract]:Web2.0 provides a wealth of applications for users, and a large number of users' deep participation makes Web an ecosystem. While displaying information to users, Web2.0 also attracts users to contribute a lot of content, and the content generated by these users is of great value.
As a typical Web2.0 application, the forum provides users with a platform for information acquisition and communication. Users publish information and comments on the forum, such as introducing product use, communicating life sentiment, discussing school education, and publishing social news, which really reflect users' needs, views, and social phenomena. How to extract forum data from Web pages to support commodity recommendation, expert discovery, public opinion monitoring and other applications has strong research and practical significance.
The forum data is more complex. It not only contains user generated content, but also includes noise data such as recommendation and advertising. In addition, there are great differences in the style of forum sites. This brings challenges to the forum data extraction. The traditional Web data extraction technology is usually oriented to relatively structured data, which is not suitable for forum data extraction, because it is not suitable for forum data extraction. The efficient extraction technology for forum data needs to be studied. The main contributions of this paper include the following aspects:
A method of forum data extraction which integrates inductive logic programming and XPath pattern learning is proposed. This method has high accuracy rate and recall rate. This method takes full account of the structural features of the forum pages, introduces new predicates, integrates logical program expressions and XPath patterns, and uses a divide and conquer method to learn XPath modules. In order to describe the structural features of the target data. Finally, the learning XPath pattern rules are converted into XSLT files, and the extracted forum data are stored in a predefined model to automatically extract the forum data.
A method of unsupervised forum data extraction is proposed. This method takes full account of the structural features of Web pages and the links between pages. This method significantly improves the automation of extraction. Based on the similarity of the structure of the same forum site page, the Web page is divided into stable areas by multi page joint comparison method. In the unstable region, most noise data of Web pages are removed by page level filtering and template level filtering. Then, using the relationship between the path in the path of the stable region and the path in the unstable region, the dependency relationship between the path and the path is introduced to determine whether a path belongs to the extraction target. Diameter, the automatic extraction of the content of the forum posts.
An unsupervised forum data extraction rule generation method is proposed. This method fully considers the structure of Web pages and the features of page content, improves the adaptability to different forums and ensures the integrity of the post extraction. This method is a two stage extraction rule generation method, and the Web page structure is exploited, users send it. In the user information processing stage, the user area is obtained through the Web page, and the maximum substructure in the user area is found, and the user information is obtained, which converts the user area into a record in the relational table in the post content processing stage, according to the genera. The function dependence between sex is used to distinguish between the content of the post and the noise data. Finally, the path of the content corresponding to the two stages is summed up into the extraction rule expressed in the regular tree structure.
To sum up, three kinds of forum data extraction methods are proposed from different requirements. The first method uses supervised extraction rule learning model, which can obtain better accuracy and recall rate, and is more suitable for small scale forum data sets; the second method is unsupervised extraction method, directly extracted from Web pages. Data, unexplicitly output extraction rules, suitable for large scale forum data sets; the third method is also an unsupervised method. First, it learns to extract rules, then extracts data based on rules, takes into account the automation and extraction performance of rule generation, and can adapt to a larger data set. Real forum data are based on real forum data. Experiments show that the above methods can effectively extract data from various forums.
【学位授予单位】：华东师范大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP393.09

【共引文献】