基于视觉语义块的网页正文提取算法研究

发布时间：2018-06-01 09:44

本文选题：网页正文提取 + DOM树　；参考：《浙江大学》2013年硕士论文

【摘要】：随着互联网技术的迅猛发展,以及网络信息的爆炸性增长,网页的数量也开始了急剧增加,人们也越来越习惯于借助搜索引擎这一工具来从浩如烟海的互联网上获取自己所需要的信息。然而,通常情况下一张网页并不仅仅包含了用户所需的正文信息,它还通常还有其他各种非正文信息,如导航栏,广告链接,推荐链接等,这类噪音信息的存在,给搜索引擎的效率和准确率带来了极大地干扰,因此,网页正文提取技术也就成为了搜索引擎领域的一个重大课题。本文提出了一种基于视觉语义块的网页正文提取算法,该算法摆脱了现有主流正文提取算法对于网页文本的依赖性,而是从用户视觉角度出发,将网页根据语义特征分割为一个个语义块,然后寻找其中面积最大的语义块,再进而寻找与之结构类似的语义块,通过不断循环查找,最后提取出网页的正文信息。一方面,由于该算法并不依赖于网页文本分布密度,在一些噪音信息同样含有大量文本的网页中也能取得很好的效果,而且还能将正文信息中包含的图片、视频等也一并提取出来,因而提高了算法的健壮性；另一方面,该算法在处理DOM树的过程中,并不需要遍历整棵DOM树来查找目标信息,而只需要对DOM树的叶子结点进行处理,从而节省了查找时间,大大提高了正文提取的效率。本文通过对15个门户网站的300个网页进行了实验分析,其中包含了新闻、博客、论坛、BBS等各类主题性网页。实验结果表明,本文的基于视觉语义块的网页正文提取算法基本可以达到94%以上的提取准确率和召回率。而且由于算法角度的不同,该算法还可以与其他传统基于网页文本的算法相结合,得到更好的效果。
[Abstract]:With the rapid development of Internet technology and the explosive growth of network information, the number of web pages has also increased dramatically. More and more people are used to the search engine to get the information they need from the vast Internet. Usually, however, a web page contains not only the text information that the user needs, but also other non-text information, such as navigation bar, advertising link, recommendation link, etc. It brings great interference to the efficiency and accuracy of search engine. Therefore, the technology of web page text extraction has become an important subject in the field of search engine. In this paper, a text extraction algorithm based on visual semantic block is proposed. This algorithm breaks away from the dependence of existing mainstream text extraction algorithms on web page text, but starts from the perspective of user vision. The web pages are divided into semantic blocks according to their semantic features, and then the semantic blocks with the largest area are found, and then the semantic blocks similar to the semantic blocks are found. Finally, the text information of the web pages is extracted by continuous loop searching. On the one hand, because the algorithm does not depend on the distribution density of the page text, it can also achieve good results in the web pages where some noise information also contains a lot of text, and can also include images in the text information. Video and so on are extracted at the same time, which improves the robustness of the algorithm. On the other hand, the algorithm does not need to traverse the whole DOM tree to find the target information, but only need to deal with the leaf nodes of the DOM tree, in the process of processing the DOM tree, the algorithm does not need to traverse the whole DOM tree to find the target information. Thus, the searching time is saved and the efficiency of text extraction is greatly improved. This paper makes an experimental analysis of 300 web pages of 15 portals, including news, blogs, forums and BBS. The experimental results show that the algorithm based on visual semantic block can achieve more than 94% extraction accuracy and recall rate. Because of the different angles of the algorithm, the algorithm can be combined with other traditional algorithms based on web page text to get better results.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】