垂直搜索引擎中网页信息抽取技术的研究

发布时间：2018-02-17 02:41

本文关键词： 垂直搜索引擎 Web对象信息抽取 VIPS 分块重要度 2D CRFs HCRFs　出处：《江南大学》2012年硕士论文　论文类型：学位论文

【摘要】：随着互联网的迅速发展,网络上的信息资源呈爆炸式的增长,通用搜索引擎的瓶颈越发的显露出来,为了更加快速、准确的定位到人们想要的信息,近年来产生了垂直搜索引擎。它是面向某一特定领域的搜索引擎,提供比通用搜索引擎更精细化的搜索结果,因此需要从网页中抽取出与领域相关的信息。本文主要对垂直搜索引擎中的网页信息抽取技术进行学习和研究,具体内容包括以下几个方面: (1)基于视觉特征的Web页面分析技术。在对基于视觉特征的页面分割方法(VIPS)进行学习和研究的基础上,实现了VIPS算法的原型系统,并应用该系统对待抽取Web页面进行分割,为后续的抽取工作提供数据准备。 (2)基于分块重要度和2D CRFs的Web对象信息抽取。该部分针对Web对象信息抽取流程,提出了一种基于分块重要度和2D CRFs的Web对象信息抽取方法。首先使用分块重要度模型(BIM)对由视觉分割得到的网页块进行重要度检测,定位出包含对象信息的目标块;然后针对目标网页块的二维结构特征建立2D CRFs模型,实现对象信息的抽取;最后用对比实验验证了该方法的可行性。 (3)基于改进的HCRFs的Web对象信息抽取。 HCRFs是一种可以用于Web对象抽取的统计模型,但HCRFs并没有完整的描述Web对象元素之间的条件依赖关系,本文提出了一种改进的层次条件随机域模型LL-HCRFs和一种增加对象元素间长距离依赖关系的方法,并针对新增加的依赖关系改进了原有的参数估计算法。最后通过LL-HCRFs与Liner-CRFs和HCRFs的对比实验,证明此改进模型在对Web对象抽取上有着良好的效果。 (4)“搜食计”垂直搜索引擎。论文的最后一部分设计并实现了一个餐饮领域内的垂直搜索引擎原型系统“搜食计”,并对该原型系统的各个功能模块进行了详细的介绍。
[Abstract]:With the rapid development of the Internet, the information resources on the network are explosive growth, the bottleneck of the general search engine is more and more exposed, in order to locate the information people want more quickly and accurately. Vertical search engines have emerged in recent years. They are search engines for a particular area that provide more refined search results than generic search engines. Therefore, it is necessary to extract domain-related information from web pages. This paper mainly studies the technology of web page information extraction in vertical search engine, including the following aspects:. Web page analysis technology based on visual features. On the basis of studying and studying the visual feature based page segmentation method, the prototype system of VIPS algorithm is implemented, and the system is used to segment the extracted Web pages to provide data preparation for the subsequent extraction work. 2) Web object information extraction based on block importance and 2D CRFs. In this part, a Web object information extraction method based on block importance and 2D CRFs is proposed, which is based on block importance and 2D CRFs. Firstly, the block importance model is used to detect the importance of a web page block obtained by visual segmentation. The target block containing object information is located, and 2D CRFs model is established to extract object information according to the two-dimensional structural features of the target web page block. Finally, the feasibility of the method is verified by a comparative experiment. Web object information extraction based on improved HCRFs. HCRFs is a statistical model that can be used to extract Web objects, but HCRFs does not fully describe the conditional dependencies between Web object elements. In this paper, an improved hierarchical conditional random field model (LL-HCRFs) and a method to increase the long distance dependence between object elements are proposed. Finally, by comparing LL-HCRFs with Liner-CRFs and HCRFs, it is proved that the improved model has a good effect on Web object extraction. 4) search Meter vertical search engine. In the last part of the paper, a vertical search engine prototype system called "food search meter" is designed and implemented, and the functional modules of the prototype system are introduced in detail.
【学位授予单位】：江南大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】