基于网页结构的信息抽取关键技术研究

发布时间：2018-03-18 04:03

本文选题：搜索引擎　切入点：主题型网页　出处：《华南理工大学》2011年硕士论文　论文类型：学位论文

【摘要】：互联网已经成为人们生活中重要的信息来源,在网络信息快速增长的情况下,如何从海量的信息中找到用户所要的信息是一个很大的挑战。搜索引擎的出现使得这个问题得到了比较好的解决,但是由于网络中大量的信息都是用HTML语言来发布的,而HTML本身是一种半结构化的语言,这种语言用定义好的标签来组织信息,只有少量的标签本身能提供的信息。互联网上的HTML网页虽然千差万别,但是有两类网页的特点是非常明显的:主题型网页和非主题型网页。非主题型网页的特点是整个网页的链接非常多,并且整个网页没有统一的主题,互联网上的门户网站及其次级站点是这类型网页的典型。主题型网页的特点网页有中心主题而且按照其页面的布局可以分为导航、主题、版权信息、广告等部分,新闻网页是这种网页的典型例子。本文针对主题型网页设计了新的网页分块方法,该方法采用网页的组织标签作为分割依据,设定了若干分块规则。与木棉原有分块分块方法相比,新方法引入了临时分块池,以便于将分块之间的小块合并成为一个大块,使分块粒度不至于过细。另外新方法还引入了分块类型的判断规则用于判断分块的属性,分块共分为链接块,页脚块,噪音块,主题块四种类型,新分块方法只保留了主题块,其他类型的块作为因为含有信息量少而被丢弃。在分块的基础之上,本文针对华南理工校园网网页设计并实现了新的信息抽取方法,这些方法用于抽取校内网页中的如下信息:网页标题,网页发布时间,网页描述图片,网页正文文本。原有系统已经对前三项信息进行抽取,但是没有利用到网页的主题信息,因此抽取的信息不够全面或者有些信息抽取不够准确,新的方法充分利用了网页的主题信息,有效地改善了信息抽取的准确性,新方法增加了网页正文文本这一项的抽取,可用于网页文本摘要。本文最后对网页的基本性质,网页分块以及信息抽取方法进行评测,评测将在以下三个方面展开:网页性质测试,分块方法性能对比,信息抽取应用结果。其中信息抽取应用于木棉检索系统中,比较原有方法和新抽取方法的抽取信息的效果。测试的数据集由华工校内网页和互联网9个门户网站的主题型网页和非主题型网页组成。
[Abstract]:The Internet has become an important source of information in people's lives. How to find the information users want from the mass of information is a great challenge. The emergence of search engine makes this problem solved better, but because a lot of information in the network is published in HTML language, HTML itself is a semi-structured language, which uses defined tags to organize information, with only a small number of tags itself providing information. HTML pages on the Internet are very different, but the characteristics of two types of pages are very obvious: theme pages and non-thematic pages. And there is no uniform theme for the whole page. The portal and its secondary sites on the Internet are typical of this type of webpage. The characteristic pages of themed pages have central themes and can be divided into navigation and themes according to the layout of their pages. Copyright information, advertising, etc., news pages are typical examples of such web pages. In this paper, a new method of web page partitioning is designed for thematic web pages. In this method, the organizational labels of web pages are used as the basis of segmentation, and some rules of partitioning are set up. Compared with the original block partitioning method of kapok, the new method introduces temporary block pools. In addition, the new method also introduces the judging rule of block type to judge the attribute of block, which is divided into link block, footer block and noise block. There are four types of topic blocks. The new block method only preserves topic blocks, while other types of blocks are discarded because they contain little information. On the basis of the block, this paper designs and implements a new information extraction method for the campus network of South China Science and Technology. These methods are used to extract the following information from the internal pages: page title, page release time, page description picture, etc. The original system has already extracted the first three items of information, but has not used the subject information of the page, so the extracted information is not comprehensive enough or some information extraction is not accurate enough, The new method makes full use of the topic information of the web page and improves the accuracy of the information extraction effectively. The new method adds the extraction of the text of the text of the web page and can be used in the text summary of the web page. At the end of this paper, the basic properties of web pages, the segmentation of web pages and the methods of information extraction are evaluated. The evaluation will be carried out in the following three aspects: testing the properties of web pages, comparing the performance of the partitioning methods, Application results of information extraction. Among them, information extraction is used in kapok cotton retrieval system, The data set is composed of subject pages and non-thematic pages of 9 Internet portals.
【学位授予单位】：华南理工大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：TP393.092

【引证文献】