当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于特定领域对象级垂直搜索中的对象抽取问题的研究

发布时间:2018-04-19 13:47

  本文选题:对象级搜索引擎 + Web信息抽取 ; 参考:《电子科技大学》2015年硕士论文


【摘要】:随着信息时代的到来,互联网上如雨后春笋一般出现了各种信息站点,给人们提供了大量的有用信息。但是出现了一个新的挑战,就是如何能让人快速定位到自己所需的信息,搜索引擎正是在这一背景下酝酿而生,用户可以通过它快速查找信息。搜索引擎由最开始的半机械半人工的目录式搜索发展到现在主流的全文搜索引擎和垂直搜索引擎,但就目前最成熟的全文搜索技术,在单个领域上的网页收集能力,还是有一定的欠缺,导致查准率和查全率达不到理想的目标。虽然垂直搜索技术在单个领域上的信息收集能力有所增强,但是依然像全文搜索一样,提供基于网页级的搜索服务,需要用户进行再次过滤。因此就出现了对象级垂直搜索这一新的搜索模式,它是提供基于特定领域的对象级搜索,提交给用户的查询结果是搜索系统经过一系列的抽取集成所形成的对象实体。但是目前现有的对象级搜索引擎在对象信息抽取模块,都属于半自动化模式,前期需要大量人力对部分网页进行标注,从而获取对象抽取的先验知识。因此本文针对这种情况,研究并改进了Road Runner全自动抽取算法,设计实现了对象级垂直搜索引擎中的自动信息抽取模块。本文主要在以下两个方面进行了改进:(1)改进了简单树匹配算法,提高了判断相似的准确率。原始的简单树匹配算法对网页DOM树结构中所有标签节点进行统一处理,并没有考虑到迭代标签的特殊性,改进后对迭代标签进行了一定的处理后再进行匹配比较。(2)改进了Road Runner算法的属性标注模块,利用不同包装器之间抽取对象的关联进行交叉标注,提高了抽取数据的属性标注率。Road Runner算法本身采用的属性标注技术是基于网页信息中属性值和属性名成对出现,而大部分网页中存在部分属性名缺失的情况。最后本文利用上述改进的算法实现了对象信息抽取系统,并在图书领域进行了抽取测试。
[Abstract]:With the arrival of the information age, there are a variety of information sites on the Internet, which provide people with a lot of useful information.However, a new challenge has emerged, that is, how to quickly locate the information one needs. It is in this context that the search engine is conceived, and users can quickly find information through it.The search engine has developed from the first semi-mechanical and semi-artificial directory search engine to the mainstream full-text search engine and vertical search engine. However, with regard to the most mature full-text search technology at present, the ability to collect web pages in a single field,There are still some deficiencies, resulting in precision and recall rate can not reach the ideal goal.Although vertical search technology in a single field of information collection ability has been enhanced, but still like full-text search, to provide Web-based search services, the need for users to filter again.Therefore, a new search pattern named object level vertical search appears, which provides object level search based on specific domain. The query result submitted to user is an object entity formed by a series of extraction integration in the search system.However, the existing object-level search engine in the object information extraction module, all belong to the semi-automatic mode, a lot of manpower is needed to annotate part of the web pages in the early stage, so as to obtain the prior knowledge of object extraction.Therefore, in this paper, we study and improve the Road Runner automatic extraction algorithm, and design and implement the automatic information extraction module in the object level vertical search engine.In this paper, we improve the simple tree matching algorithm in the following two aspects: 1) improve the accuracy of judging similarity.The original simple tree matching algorithm unifies all tag nodes in the web page DOM tree structure without considering the particularity of iterative tags.This paper improves the attribute tagging module of Road Runner algorithm and uses the association of objects extracted between different wrappers for cross-tagging.The attribute tagging rate of extracting data. Road Runner algorithm itself is based on the fact that attribute values and attribute names appear in pairs in web page information, while some attribute names are missing in most web pages.Finally, an object information extraction system is implemented by using the above improved algorithm, and the extraction test is carried out in the field of books.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP391.3


本文编号:1773341

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1773341.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户60ed2***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com