网页主题信息抽取系统设计与实现

发布时间：2018-05-01 20:42

本文选题：网页主题信息抽取 + 网页预处理　；参考：《哈尔滨工业大学》2012年硕士论文

【摘要】：随着互联网信息爆炸式增长，互联网已经成为人们日常生活中信息的重要来源。由于信息量非常大，人工手动查找已经变得越来越困难，所以搜索引擎已经成为人们日常生活当中不可或缺的工具。搜索引擎的本质是利用信息去找信息，利用信息的第一步是对信息本身的理解，，而搜索引擎所利用的信息大部分是含有大量噪音信息的网页，所以对网页信息的抽取成为搜索引擎从业人员关注的重点课题。本文实现了一种通用的网页主题信息抽取方法。针对现在互联网上很多网页都不是严格规范化的网页，本文首先进行网页预处理，对网页进行文件类型识别、编码处理、脚本抽取以及网页容错与净化处理。针对现有网页主题信息抽取系统没有利用网页本身结构特征及视觉特征，本文提出一种利用视觉信息与语义特征的网页主题信息提取算法，算法利用网页解析把半结构化的网页文件解析成结构化的DOM（DocumentObjectModel）树，同时把CSS（CascadingStyleSheets）信息解析出来，对DOM树节点进行染色，形成一棵带有视觉信息的DOM树。然后利用VIPS（Vision-BasedPageSegmentation）算法对网页进行划分，形成一棵层次化的具有单独语义特征的内容树，之后对内容块进行层次聚类，把临近的块聚合到一个类别当中，形成聚类的集合。最后利用内容块的结构特征与语义特征，对每个块进行主题相关度打分，根据预先设定的阈值对主题信息抽取与输出。在对中文网页上的实验结果表明，在中文新闻网页的的抽取上，精度F值达到0.93，在中文普通网页的抽取上，F值也能够达到0.84。实验结果表明，本文方法基本满足实际使用要求。
[Abstract]:With the explosion of Internet information, the Internet has become an important source of information in people's daily life. As the amount of information is very large, manual search has become more and more difficult, so the search engine has become an indispensable tool in people's daily life. The essence of search engines is to use information to find information. The first step of the use of information is to understand the information itself, and the information used by the search engine is mostly a web page containing a lot of noise information, so the extraction of Web information has become the focus of the search engine employees.
In this paper, a common web page topic information extraction method is implemented. In this paper, many web pages on the Internet are not strictly normalized pages. This paper first performs web page preprocessing, file type identification, coding processing, script extraction, and web page fault tolerance and purification. It does not make use of the structural and visual features of the web page itself. This paper presents a web page topic information extraction algorithm using visual information and semantic features. The algorithm uses web page resolution to parse the semi structured web pages into a structured DOM (DocumentObjectModel) tree and parse the CSS (CascadingStyleSheets) information at the same time. The DOM tree nodes are dyed to form a DOM tree with visual information. Then the VIPS (Vision-BasedPageSegmentation) algorithm is used to divide the web pages to form a hierarchical content tree with separate semantic features. After that, the content blocks are hierarchical clustering, and the adjacent blocks are aggregated into one category to form a cluster set. Finally, using the structural features and semantic features of the content blocks, the topic correlation of each block is scored and the subject information is extracted and output according to the predetermined threshold.
The experimental results on Chinese web pages show that the accuracy of F is 0.93 in the extraction of Chinese News Web pages. In the extraction of Chinese common web pages, the F value can also reach 0.84. experimental results. This method basically meets the requirements of actual use.

【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3;TP393.092

【参考文献】