基于改进DOM树的主题型网页去噪声研究
发布时间:2018-06-22 07:31
本文选题:主题型网页 + DOM树 ; 参考:《西南大学》2017年硕士论文
【摘要】:随着Internet的高速发展,Web上承载的网页数据也与日俱增。一个普通网页上包含的数据一般可以分成两部分:内容块和噪声块,其中噪声块主要包括网页顶部或侧边的导航栏、四周的广告条和底部的版权信息等。噪音数据几乎占据网页的一半比例,并且这个比例还在持续增长。网页噪音数据的持续增长不仅使用户更难获取与主题相关的信息,而且加大用户搜索有用信息的效率,因此如何快速去除网页上与主题信息无关的噪音信息显得尤为重要。网页去噪的方法一般分为基于网页模板的去噪方法、基于网页视觉信息的去噪方法和基于DOM树的去噪方法。本文主要基于DOM树结构对主题型网页进行去噪处理。在以往的基于DOM树的网页去噪研究中,研究者大多根据设定的规则首先将DOM树节点划分不同类型,然后根据节点类型判断哪些是噪音节点。但根据某单一因素便过早将节点划分不同类型,可能会造成节点类型误判,从而影响后续的去噪效果。另外本文通过分析国内几大门户网站的二级详情页,发现主题型的网页具有主题突出、文字内容较多、图片和链接较少等特征。针对以往基于DOM树研究的不足和主题型网页的结构特点、文本特点、标签语义特点等,本文在传统DOM树基础上构建一种改进的DOM树模型,并基于此改进的DOM树模型给出了主题型网页的去噪方法,研究的主要内容如下:(1)将HTML标签依据与主题相关性和节点划分粒度分为主题块标签和非主题块标签。综合考虑主题型网页中标签与主题语义关联度、节点内链接特征值、节点内文本长度、节点内子节点纯文本节点数、节点内图片个数,在构建DOM树时依次给Node节点添加自定义属性tagDeg、linkVal、text Len、textNum、picNum。(2)提出了改进DOM树模型。首先把HTML文档解析成DOM树结构,然后遍历DOM树依次给DOM树中节点添加自定义属性,在对DOM内非主题块节点进行合并时,同时也对节点内新添加属性tagDeg和link Val的值进行累加计算,最后构建只包含主题块节点的改进的DOM树模型。(3)给出了基于改进DOM树模型的网页去噪方法。该方法主要包括网页预处理、构建改进DOM树模型和改进DOM树网页去噪。其中,改进DOM树网页去噪中通过分析对比节点内自定义属性值与设定的阈值,从而确定并删除噪音节点,达到网页去噪的目的。最后通过实验分析,表明该方法对主题型网页具有较好的去噪效果。
[Abstract]:With the rapid development of the Internet, the web data on the Web is also increasing. The data contained on an ordinary web page can be divided into two parts: the content block and the noise block, where the noise block mainly includes the navigation bar at the top or side of the page, the advertising bar around the page and the copyright information at the bottom. Noise data account for almost half of all web pages, and that proportion continues to grow. The continuous growth of noise data not only makes it more difficult for users to obtain theme-related information, but also increases the efficiency of searching useful information. Therefore, how to quickly remove the noise information that is not related to topic information is particularly important. The methods of web page denoising are generally divided into three kinds: one is based on page template, the other is based on visual information and Dom tree. This paper mainly based on Dom tree structure to the theme web page denoising processing. In the previous researches of Web page denoising based on Dom tree, most researchers divide Dom tree nodes into different types according to the set rules, and then judge which noise nodes are noise nodes according to the node types. However, according to a single factor, nodes can be divided into different types prematurely, which may result in node type misjudgment, which will affect the effect of subsequent de-noising. In addition, by analyzing the secondary detail pages of several domestic portals, it is found that the topic-oriented web pages have the characteristics of prominent themes, more text content, less pictures and links, and so on. In view of the shortcomings of previous researches based on Dom tree and the structural characteristics, text features and label semantics of themed web pages, this paper constructs an improved Dom tree model based on the traditional Dom tree. Based on the improved Dom tree model, the denoising method of topic web pages is presented. The main contents of the research are as follows: (1) the HTML tags are divided into topic block tags and non-topic block tags according to their relevance and node granularity. Considering the semantic correlation degree between label and topic, the link eigenvalue, the length of text, the number of pure text nodes, the number of images in nodes. In the process of building Dom tree, we add the custom attribute tagDeglinkValo text LentextNump Numu to Node node in turn. (2) an improved Dom tree model is proposed. Firstly, the HTML document is parsed into a Dom tree structure, then traversing the Dom tree to add custom attributes to the nodes in the Dom tree in turn. When merging the non-topic block nodes in the Dom, it also accumulates the values of the newly added attributes tagDeg and link Val in the nodes. Finally, an improved Dom tree model containing only topic block nodes is constructed. (3) the method of web page denoising based on the improved Dom tree model is presented. This method mainly includes page preprocessing, building improved Dom tree model and improving Dom tree web page denoising. In the improved Dom tree web page denoising, by analyzing and comparing the custom attribute value and the set threshold in the node, the noise node can be determined and deleted to achieve the purpose of web page denoising. Finally, the experimental results show that the method has a better denoising effect on the theme web pages.
【学位授予单位】:西南大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092
【参考文献】
相关期刊论文 前7条
1 谢方立;周国民;王健;;基于节点类型标注的网页主题信息抽取方法[J];计算机科学;2016年S2期
2 彭红超;童名文;邹军华;郝秋红;;基于规则的网页分割预处理算法研究[J];计算机科学;2013年S2期
3 李霞;蒋盛益;;基于DOM树及行文本统计去噪的网页文本抽取技术[J];山东大学学报(理学版);2012年03期
4 毛先领;何靖;闫宏飞;;网页去噪:研究综述[J];计算机研究与发展;2010年12期
5 欧健文,董守斌,蔡斌;模板化网页主题信息的提取方法[J];清华大学学报(自然科学版);2005年S1期
6 荆涛,左万利;基于可视布局信息的网页噪音去除算法[J];华南理工大学学报(自然科学版);2004年S1期
7 张志刚;陈静;李晓明;;一种HTML网页净化方法[J];情报学报;2004年04期
相关硕士学位论文 前5条
1 马金娜;基于DOM树节点重要度的WEB主题信息提取研究[D];西南大学;2016年
2 王迎;基于XML用户自定义需求的WEB信息提取研究[D];西南大学;2014年
3 邵振凯;Web网页去噪及信息提取算法的研究与应用[D];安徽理工大学;2013年
4 张瑞雪;基于DOM树的网页相似度研究与应用[D];大连理工大学;2011年
5 徐超;基于DOM的网页净化方法研究[D];中国石油大学;2009年
,本文编号:2052092
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2052092.html