基于改进DOM树的主题型网页去噪声研究

发布时间：2018-06-22 07:31

本文选题：主题型网页 + DOM树　；参考：《西南大学》2017年硕士论文

【摘要】：随着Internet的高速发展,Web上承载的网页数据也与日俱增。一个普通网页上包含的数据一般可以分成两部分:内容块和噪声块,其中噪声块主要包括网页顶部或侧边的导航栏、四周的广告条和底部的版权信息等。噪音数据几乎占据网页的一半比例,并且这个比例还在持续增长。网页噪音数据的持续增长不仅使用户更难获取与主题相关的信息,而且加大用户搜索有用信息的效率,因此如何快速去除网页上与主题信息无关的噪音信息显得尤为重要。网页去噪的方法一般分为基于网页模板的去噪方法、基于网页视觉信息的去噪方法和基于DOM树的去噪方法。本文主要基于DOM树结构对主题型网页进行去噪处理。在以往的基于DOM树的网页去噪研究中,研究者大多根据设定的规则首先将DOM树节点划分不同类型,然后根据节点类型判断哪些是噪音节点。但根据某单一因素便过早将节点划分不同类型,可能会造成节点类型误判,从而影响后续的去噪效果。另外本文通过分析国内几大门户网站的二级详情页,发现主题型的网页具有主题突出、文字内容较多、图片和链接较少等特征。针对以往基于DOM树研究的不足和主题型网页的结构特点、文本特点、标签语义特点等,本文在传统DOM树基础上构建一种改进的DOM树模型,并基于此改进的DOM树模型给出了主题型网页的去噪方法,研究的主要内容如下:(1)将HTML标签依据与主题相关性和节点划分粒度分为主题块标签和非主题块标签。综合考虑主题型网页中标签与主题语义关联度、节点内链接特征值、节点内文本长度、节点内子节点纯文本节点数、节点内图片个数,在构建DOM树时依次给Node节点添加自定义属性tagDeg、linkVal、text Len、textNum、picNum。(2)提出了改进DOM树模型。首先把HTML文档解析成DOM树结构,然后遍历DOM树依次给DOM树中节点添加自定义属性,在对DOM内非主题块节点进行合并时,同时也对节点内新添加属性tagDeg和link Val的值进行累加计算,最后构建只包含主题块节点的改进的DOM树模型。(3)给出了基于改进DOM树模型的网页去噪方法。该方法主要包括网页预处理、构建改进DOM树模型和改进DOM树网页去噪。其中,改进DOM树网页去噪中通过分析对比节点内自定义属性值与设定的阈值,从而确定并删除噪音节点,达到网页去噪的目的。最后通过实验分析,表明该方法对主题型网页具有较好的去噪效果。
[Abstract]:With the rapid development of the Internet, the web data on the Web is also increasing. The data contained on an ordinary web page can be divided into two parts: the content block and the noise block, where the noise block mainly includes the navigation bar at the top or side of the page, the advertising bar around the page and the copyright information at the bottom. Noise data account for almost half of all web pages, and that proportion continues to grow. The continuous growth of noise data not only makes it more difficult for users to obtain theme-related information, but also increases the efficiency of searching useful information. Therefore, how to quickly remove the noise information that is not related to topic information is particularly important. The methods of web page denoising are generally divided into three kinds: one is based on page template, the other is based on visual information and Dom tree. This paper mainly based on Dom tree structure to the theme web page denoising processing. In the previous researches of Web page denoising based on Dom tree, most researchers divide Dom tree nodes into different types according to the set rules, and then judge which noise nodes are noise nodes according to the node types. However, according to a single factor, nodes can be divided into different types prematurely, which may result in node type misjudgment, which will affect the effect of subsequent de-noising. In addition, by analyzing the secondary detail pages of several domestic portals, it is found that the topic-oriented web pages have the characteristics of prominent themes, more text content, less pictures and links, and so on. In view of the shortcomings of previous researches based on Dom tree and the structural characteristics, text features and label semantics of themed web pages, this paper constructs an improved Dom tree model based on the traditional Dom tree. Based on the improved Dom tree model, the denoising method of topic web pages is presented. The main contents of the research are as follows: (1) the HTML tags are divided into topic block tags and non-topic block tags according to their relevance and node granularity. Considering the semantic correlation degree between label and topic, the link eigenvalue, the length of text, the number of pure text nodes, the number of images in nodes. In the process of building Dom tree, we add the custom attribute tagDeglinkValo text LentextNump Numu to Node node in turn. (2) an improved Dom tree model is proposed. Firstly, the HTML document is parsed into a Dom tree structure, then traversing the Dom tree to add custom attributes to the nodes in the Dom tree in turn. When merging the non-topic block nodes in the Dom, it also accumulates the values of the newly added attributes tagDeg and link Val in the nodes. Finally, an improved Dom tree model containing only topic block nodes is constructed. (3) the method of web page denoising based on the improved Dom tree model is presented. This method mainly includes page preprocessing, building improved Dom tree model and improving Dom tree web page denoising. In the improved Dom tree web page denoising, by analyzing and comparing the custom attribute value and the set threshold in the node, the noise node can be determined and deleted to achieve the purpose of web page denoising. Finally, the experimental results show that the method has a better denoising effect on the theme web pages.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092

【参考文献】