基于DOM的网页净化方法研究
发布时间:2018-12-15 00:34
【摘要】: Internet已经成为最重要的信息库。浏览Internet会看到网页中会包含大量和我们关心内容无关的导航条、广告信息、版权信息、以及调查问卷等。这些不相关的内容严重影响了Web信息挖掘的效果。网页净化技术致力于把混乱的网页内容清晰化、结构化、条理化,并清除不相关的内容。网页净化技术已经成为Web信息挖掘的关键技术。 介绍了网页净化的相关技术及其在Web信息挖掘中的重要作用,研究了目前流行的网页分割模型,分析了它们的优势和不足。根据目前商业网页的设计风格是“DIV加CSS”风格,并且网页设计师特意把逻辑相关的信息放到同一个DIV标签里并用样式表控制布局这样一个事实,提出了一种新的网页分割模型DSS_DOM。该模型识别出网页中的基本数据单元,并划分出整个网页的逻辑区域。研究了基于DSS_DOM模型的网页净化算法,该算法分析了网页噪音的特点,总结出一套评价准则,通过分配权重的方式判断出网页各个逻辑区域的重要性,识别出主题区域和噪音区域,达到了净化网页的目的。 利用开源项目Lucene对净化后的网页集建立了索引,在网页净化的基础上实现了搜索功能。实验证明DSS_DOM模型及其算法减少了Lucene的索引量,提高了Lucene的查准率。把DSS_DOM模型及其算法应用于CPCK中文网页分类器,在网页净化的基础上实现了中文网页自动分类。实验结果表明,DSS_DOM模型及其算法明确了各个网页的主题和类别,提高了网页分类的准确性。
[Abstract]:Internet has become the most important information base. The Internet page will contain a large number of navigation bars, advertising information, copyright information, and questionnaires that are not relevant to our concerns. These irrelevant contents seriously affect the effect of Web information mining. Web purification technology aims to clear, structure, organize, and eliminate irrelevant content. Web page purification technology has become the key technology of Web information mining. This paper introduces the relevant technologies of web page purification and its important role in Web information mining, studies the popular web page segmentation models, and analyzes their advantages and disadvantages. Based on the fact that business web pages are currently designed in a "DIV plus CSS" style, and web designers deliberately place logically relevant information in the same DIV tag and use stylesheets to control layout, A new web page segmentation model, DSS_DOM., is proposed. The model identifies the basic data unit in the web page and divides the logical region of the whole web page. A page purification algorithm based on DSS_DOM model is studied in this paper. The algorithm analyzes the characteristics of web page noise, summarizes a set of evaluation criteria, and determines the importance of each logical region of the page by assigning weights. Identify the theme area and noise area, achieve the purpose of purifying the web page. An open source project, Lucene, is used to index the purified web pages, and the search function is realized on the basis of the purification of the web pages. Experimental results show that the DSS_DOM model and its algorithm can reduce the number of Lucene indexes and improve the precision of Lucene. The DSS_DOM model and its algorithm are applied to the CPCK Chinese web page classifier, and the automatic Chinese web page classification is realized on the basis of page purification. The experimental results show that the DSS_DOM model and its algorithm define the topics and categories of each web page and improve the accuracy of web page classification.
【学位授予单位】:中国石油大学
【学位级别】:硕士
【学位授予年份】:2009
【分类号】:TP393.092
本文编号:2379606
[Abstract]:Internet has become the most important information base. The Internet page will contain a large number of navigation bars, advertising information, copyright information, and questionnaires that are not relevant to our concerns. These irrelevant contents seriously affect the effect of Web information mining. Web purification technology aims to clear, structure, organize, and eliminate irrelevant content. Web page purification technology has become the key technology of Web information mining. This paper introduces the relevant technologies of web page purification and its important role in Web information mining, studies the popular web page segmentation models, and analyzes their advantages and disadvantages. Based on the fact that business web pages are currently designed in a "DIV plus CSS" style, and web designers deliberately place logically relevant information in the same DIV tag and use stylesheets to control layout, A new web page segmentation model, DSS_DOM., is proposed. The model identifies the basic data unit in the web page and divides the logical region of the whole web page. A page purification algorithm based on DSS_DOM model is studied in this paper. The algorithm analyzes the characteristics of web page noise, summarizes a set of evaluation criteria, and determines the importance of each logical region of the page by assigning weights. Identify the theme area and noise area, achieve the purpose of purifying the web page. An open source project, Lucene, is used to index the purified web pages, and the search function is realized on the basis of the purification of the web pages. Experimental results show that the DSS_DOM model and its algorithm can reduce the number of Lucene indexes and improve the precision of Lucene. The DSS_DOM model and its algorithm are applied to the CPCK Chinese web page classifier, and the automatic Chinese web page classification is realized on the basis of page purification. The experimental results show that the DSS_DOM model and its algorithm define the topics and categories of each web page and improve the accuracy of web page classification.
【学位授予单位】:中国石油大学
【学位级别】:硕士
【学位授予年份】:2009
【分类号】:TP393.092
【引证文献】
相关期刊论文 前1条
1 邹永强;钟志农;;一种高效的新闻网页噪声过滤方法[J];微型机与应用;2011年16期
相关硕士学位论文 前7条
1 王乐超;Web环境下文献信息的提取与匹配研究[D];大连理工大学;2010年
2 邹永强;新闻网页中人物实体关系提取技术研究[D];国防科学技术大学;2011年
3 罗黎敏;基于DOM模型的网页净化系统设计与实现[D];湖南大学;2010年
4 白玉昭;垂直搜索引擎的研究与实现[D];江南大学;2012年
5 方加沛;垂直搜索引擎主要技术研究[D];暨南大学;2010年
6 陈佳佳;Deep Web数据集成研究及其在购书领域中的应用[D];暨南大学;2010年
7 莫卓颖;基于语义DOM的WEB信息抽取[D];广西师范大学;2012年
,本文编号:2379606
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/2379606.html