Web页面正文信息提取算法

发布时间：2018-03-22 03:24

本文选题：Web数据挖掘　切入点：信息抽取　出处：《广西师范大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着Internet和数据库技术的快速发展,互联网已成为信息传播的主流平台。网络上的海量信息在给人类带来方便的同时,信息冗余、形式多样、真假难辨、统一处理比较困难等一系列问题相继出现。“数据过剩”、“信息爆炸”和“知识贫乏”等现象使得人们在海量的信息中难以迅速的找到自己所需要的信息,Web数据挖掘技术因此产生。在Web数据挖掘中主要研究Web内容挖掘、Web结构挖掘和Web使用挖掘等三个方面。随着数字媒体技术的发展,Web页面中充斥着各种跨媒体信息,使得Web内容挖掘日益重要,因此本文围绕Web内容挖掘展开研究。由于Web页面信息类型的增加,信息容量的扩充,使得从单个页面中获取兴趣信息日益困难,再加上页面编撰者和拥有者为了扩大页面影响力和自身利益的需要而美化网页,通常会在页面插入许多超链接、广告等“噪音信息”,这使得用户无法快速在网页中定位自己所需要的信息。因此页面信息提取成为内容挖掘中的重要研究课题,尤其对于手机、PAD用户,页面信息提取的意义显得尤为重要。通过大量的学习总结出,目前主要的Web信息提取方法有基于统计学习、基于模板、基于DOM树和基于视觉信息这4种,本文从三方面对它们做了比较,并分析了各自的优缺点。在此基础上,提出了针对Web页面正文提取的两种方法。 (1)基于Block-DOM的WEB页面正文信息提取基于模板、基于视觉信息、基于DOM树这三种信息提取方法是目前的研究热点,本文利用它们各自的优点将三者结合起来,提出了一种基于Block-DOM的WEB页面正文信息提取方法。该方法简化了其相应技术,首先将待提取的页面进行清洗、解析、判别、分块、净化等处理,然后再提取出正文信息。实验证明,该方法快速准确,具有一定的有效性。 (2)基于块和标签用途的WEB页面正文信息提取本文提出一种基于块和标签用途的WEB页面正文信息提取方法,在DOM树和VIPS算法的基础上,总结出了利用块和标签用途去提取页面正文信息的规则,并且设计了一个噪音词过滤器,将Web页面中一些类似用户评论、留言等文本去除掉。进一步通过实验建立了一个模拟浏览器,该浏览器中拥有四个模块,分别是解析模块、分块模块、文本提取模块和噪音词过滤模块。实验证明,该方法能准确高效的提取出主题信息。
[Abstract]:With the rapid development of Internet and database technology, the Internet has become the mainstream platform of information dissemination. A series of problems appeared, such as "data surplus", "information explosion" and "lack of knowledge", which made it difficult for people to quickly find the information they needed in the mass of information. Web content mining and Web usage mining are mainly studied in Web data mining. With the development of digital media technology, web pages are full of cross-media information. Web content mining is becoming more and more important, so this paper focuses on Web content mining. Because of the increase of Web page information type and the expansion of information capacity, it is increasingly difficult to obtain interest information from a single page, and the page editor and owner beautify the page in order to expand the influence and self-interest of the page. Usually, many hyperlinks, advertisements and other "noise information" are inserted into the page, which makes it difficult for users to locate the information they need quickly in the web page. Therefore, page information extraction has become an important research topic in content mining. Especially for mobile phone pad users, the significance of page information extraction is particularly important. Through a large number of learning, the main methods of Web information extraction are based on statistical learning, based on template, based on DOM tree and based on visual information, this paper compares them from three aspects. Based on the analysis of their advantages and disadvantages, two methods for Web page text extraction are proposed. WEB page text information extraction based on Block-DOM. The three information extraction methods based on template, visual information and DOM tree are the focus of research at present. This paper presents a method of extracting the text information of WEB pages based on Block-DOM. This method simplifies the corresponding technology. Firstly, the pages to be extracted are cleaned, analyzed, distinguished, divided into blocks, purified and so on. Then the text information is extracted and the experimental results show that the method is fast and accurate. WEB page body information extraction based on block and tag usage. In this paper, a method of extracting WEB page text information based on block and tag usage is proposed. Based on DOM tree and VIPS algorithm, the rules of extracting page text information by using block and tag usage are summarized. And designed a noise word filter to remove some similar user comments, messages and other text in the Web page. Further through the experiment to establish a simulation browser, the browser has four modules, namely, parsing module. Block module, text extraction module and noise word filtering module. Experiments show that this method can extract topic information accurately and efficiently.
【学位授予单位】：广西师范大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】