Web信息智能抽取技术的研究与实现
发布时间:2018-01-14 08:37
本文关键词:Web信息智能抽取技术的研究与实现 出处:《电子科技大学》2009年硕士论文 论文类型:学位论文
更多相关文章: 信息抽取 规则生成器 模板生成器 增量/多页处理
【摘要】: 随着我国经济的迅速发展,国家信息基础设施建设强度加大加强和人民生活质量的提高,网络已经深入人们生活的方方面面,成为工作或生活中不可缺少的一部分,怎样快速有效的获取Web上的信息,已经成为了一个重要的研究课题。但是网络上的信息种类繁多、网页结构形式多变,大多数网页上还包含了许多广告、导航、热点链接等噪音信息,这些问题给研究者带来了很大的困扰。而目前的信息抽取技术还存在很多不足:如仅能处理一种类型网页,提取的信息细化程度低,准确率与效率矛盾、人工干预与智能化操作、不支持增量信息处理等问题。这就迫切需要一种全新的信息提取方法来解决这些问题,本课题就是在这种需求下产生的。 本文主要采用的是模板化的信息提取算法,先利用规则生成器识别网页上的目标实体分隔符,然后由模板生成器把这些分割标记配置到模板中,最后由信息抽取器根据模板提取该站点的相关信息。具体创新点或关键技术如下: 1、通过分析的站点网页结构,分析网页结构布局形式和标签的分布规律,并结合目前国内外的信息抽取技术,发明了一套可以定义任何网页结构形式的模板,并设计出了一套模板自动配置方案; 2、设计了信息抽取器:实现了读取模板,以及根据模板配置进行信息抽取的方法,并在此过程中增加了信息增量/多页处理算法:采用增量/多页算法来解决同一主题的内容分布在多个网页的问题,即需要进行融合计算,以及解决不同时间段,主题网页内容动态更新的问题,即要进行增量提取;去重处理算法:处理站点间相似或相同主题重复问题; 3、结果的结构化存储:根据模板的配置,提取相关的信息,并采用结构化的形式进行保存;设计一个可动态扩展的信息提取系统:根据不同的需要,动态配置模板,不需要更改代码。 本文在理论上提出了一套依据模板能自动提取各种类型网页的信息抽取方案,并开发了相应的系统IWIES。实践结果证明,本方案相对于常见的Web信息抽取技术方法具有更好的提取速度以及更高的准确率、召回率。
[Abstract]:With the rapid development of our country's economy, the strengthening of the national information infrastructure construction and the improvement of the people's quality of life, the network has gone deep into all aspects of people's life. Become an indispensable part of work or life, how to quickly and effectively obtain information on Web, has become an important research topic, but there are many kinds of information on the network. The structure of the web page is changeable, and most web pages also contain a lot of noise information, such as advertisement, navigation, hot link and so on. These problems have brought a great deal of trouble to the researchers. However, the current information extraction technology still has many shortcomings: only one type of web pages can be processed, the degree of information refinement is low, and the accuracy and efficiency are contradictory. Artificial intervention and intelligent operation do not support incremental information processing and so on. Therefore, a new information extraction method is urgently needed to solve these problems. This paper mainly uses the template-based information extraction algorithm, first using the rule generator to identify the target entity separator on the web page, and then the template generator to configure these segmentation tags into the template. Finally, the information extractor extracts the relevant information of the site according to the template. The specific innovation points or key technologies are as follows: 1. Through the analysis of the structure of the web page, the layout of the page structure and the distribution of tags, and combined with the current information extraction technology at home and abroad. A set of templates can define any form of web page structure, and a set of template automatic configuration scheme is designed. 2. The information extractor is designed: the method of reading the template and extracting the information according to the configuration of the template is implemented. In this process, the information increment / multi-page processing algorithm is added: the incremental / multi-page algorithm is used to solve the problem that the content of the same topic is distributed in multiple pages, that is, the fusion calculation is needed. And to solve the problem of dynamic updating of theme pages in different time periods, that is to do incremental extraction; De-reprocessing algorithm: to deal with similar or the same topic repeat problem between sites; (3) structured storage of results: according to the configuration of templates, the relevant information is extracted and stored in a structured form; Design a dynamic extensible information extraction system: according to different needs, dynamically configure the template without changing the code. In this paper, we propose a set of information extraction schemes based on template which can automatically extract all kinds of web pages, and develop the corresponding system IWIES. the practical results prove that. This scheme has better extraction speed, higher accuracy and higher recall than common Web information extraction methods.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2009
【分类号】:TP391.1
【引证文献】
相关期刊论文 前2条
1 郑思婷;杨p芑,
本文编号:1422856
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1422856.html