当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于Bootstrapping的领域知识自动抽取技术的研究

发布时间:2018-03-19 00:01

  本文选题:领域知识抽取 切入点:半结构化网站 出处:《山东大学》2012年硕士论文 论文类型:学位论文


【摘要】:随着互联网的高速发展及其各种Web应用的快速增长,网络上的信息规模急剧扩大。网络已经成为人们生活中重要的知识库,人们对高效地获取信息的需求尤为迫切。在网络的海量数据中,包含了大量的半结构化的领域知识,例如电影、书籍和酒店等等,这些领域知识与我们的生活秘密相关。目前,虽然可以通过搜索引擎从海量数据中进行信息检索,但是搜索的结果并不是非常可靠。而这些领域知识往往来自供应商的后台数据库,同时基于关键字匹配的搜索引擎由于自身的限制,不能索引这些嵌入在半结构化的HTML网页中的领域知识。如何从大规模的Web网站中自动抽取并组织这些领域知识成为信息抽取研究的热点。Web信息抽取技术(Web Information Extraction)可以从半结构化的网页中抽取数据,并以结构化的方式存储在数据库中。 本文在分析当前Web信息抽取技术的基础上,利用标签路径技术(Tag Path Technique)代替DOM树来表示HTML文档。该表示方法大大降低了标签的数量,提高了算法的性能。针对半结构化的网站,提出了一种新的基于Bootstrapping的自动抽取领域知识的算法:Domain-specific Knowledge Extraction from Websites, DKEW。 DKEW利用本体(Ontology)来统一标注同一领域中抽取的半结构化数据,便于存储和查询。DKEW首先利用基于标签路径技术的聚类算法对目标网页进行聚类,过滤掉噪音网页,DKEW只抽取包含详细信息的半结构化网页。根据标签路径技术,提出一种新的模式定义。对同一类别的网页,借助于机器学习方法和领域种子自动地进行模式学习。然后利用学习到的模式自动抽取领域知识并匹配到事先定义的领域本体,将匹配好的领域知识存储在结构化的、便于查询的知识库表格中。在知识抽取的同时,利用新抽取的具有高可信性的领域知识来扩充领域种子和Ontology,以便下次迭代应用。最后,通过Bootstrapping方法将相关的知识抽取过程结合起来,使之成为一套无需人工监督的自动抽取工具。DKEW只需要少量的人力进行领域种子的初始化。为了验证DKEW,本文利用自定义的网络爬虫爬取多个领域的网页数据。实验表明DKEW不仅在性能上优于现有的Web信息抽取方法RoadRunner,而且在效率上也远远高于RoadRunner。相比于RoadRunner需要手动匹配抽取的数据,DKEW利用自动的方式进行本体匹配,节省了大量的人力和时间。在多个领域上的实验表明,DKEW可以应用在大规模的Web信息抽取中。
[Abstract]:With the rapid development of the Internet and the rapid growth of various Web applications, the scale of information on the network has expanded dramatically. The network has become an important knowledge base in people's lives. The need for efficient access to information is particularly urgent. There is a large amount of semi-structured domain knowledge, such as movies, books and hotels, in the vast amount of data on the Internet that is relevant to the secrets of our lives. Although it is possible to retrieve information from vast amounts of data through a search engine, the results of the search are not very reliable. At the same time, the search engine based on keyword matching has its own limitations, Cannot index the domain knowledge embedded in semi-structured HTML web pages. How to automatically extract and organize these domain knowledge from large-scale Web websites becomes a hot topic of information extraction. To extract data from semi-structured Web pages, And stored in a structured way in the database. Based on the analysis of current Web information extraction technology, tag Path technique is used to represent HTML documents instead of DOM tree. This method greatly reduces the number of tags and improves the performance of the algorithm. A new domain knowledge extraction algorithm based on Bootstrapping:: Domain-specific Knowledge Extraction from Web sites (DKEW. DKEW) is proposed to annotate the semi-structured data extracted from the same domain. DKEW is convenient to store and query .DKEW firstly uses the clustering algorithm based on label path technology to cluster the target web pages, and filter out the noisy web pages to extract only semi-structured web pages with detailed information. According to the label path technology, DKEW can only extract the semi-structured web pages with detailed information. In this paper, a new schema definition is proposed. For a web page of the same class, pattern learning is carried out automatically by means of machine learning method and domain seed, and then domain knowledge is automatically extracted and matched to the predefined domain ontology by using the learned pattern. The matched domain knowledge is stored in a structured, query-friendly knowledge base table. At the same time, the newly extracted domain knowledge with high credibility is used to expand the domain seed and ontology for the next iteration. Finally, The related knowledge extraction process is combined by Bootstrapping method. DKEW is an automatic extraction tool without manual supervision. In order to verify DKEW, this paper uses self-defined web crawler to crawl web data from multiple domains. Experiments show that DKEW requires only a small amount of manpower to initialize the seed of the field. The results show that DKEW is not only better than RoadRunner in performance, but also more efficient than RoadRunner.Compared with data extracted by manual matching in RoadRunner, DKEW uses automatic way to match ontology. Experiments in many fields show that DKEW can be used in large-scale Web information extraction.
【学位授予单位】:山东大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1

【参考文献】

相关期刊论文 前2条

1 徐中华;;Web信息抽取方法概述[J];经营管理者;2008年09期

2 康琪;马军;;有向标记根树之间的语义编辑距离[J];模式识别与人工智能;2011年06期

相关硕士学位论文 前1条

1 马腾;基于ontology的信息抽取系统的研究与实现[D];电子科技大学;2006年



本文编号:1631894

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1631894.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户1f8da***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com