基于Web的空间数据爬取与度量研究

发布时间：2018-04-28 04:36

本文选题：空间敏感爬虫 + 空间数据爬取　；参考：《武汉大学》2013年博士论文

【摘要】：Web技术的飞速发展,为人们提供了丰富的信息,同时带来大量的信息冗余。如何快速定位用户需求,是目前网络检索中常见的问题之一。尤其在空间信息领域,空间数据涉及几何与属性两种信息,这种信息的独特性,在网络环境下只能通过文字描述信息与几何图形信息两方面分别表现。当前,对于空间信息的检索,主要集中在文字描述匹配方面,针对空间几何信息检索研究相对较少。本文在分析当前网络环境下空间信息检索存在问题的基础上,探讨了解决空间信息检索所涉及的主要研究领域,以及这些领域国内外的研究进展。论文从网络信息爬取入手,讨论空间信息在网络化环境下的主要特征与分类体系,探讨不同类型空间数据的解析与识别方法,针对不同数据类型与对应页面,阐述数据置信度度量基本方法,同时扩展空间数据分类体系,提出爬取空间数据分类标签体系思想,基于此体系,实现空间数据存储管理与后期应用,最后通过实例模型验证了空间数据爬取的某些过程,并做了相应质量评价与分析。论文针对不同空间数据类型,深入探讨了基于空间信息敏感爬虫爬取数据的基本原理与方法。首先引入空间敏感爬虫概念,介绍其与传统爬虫的异同与工作流程,以及空间敏感页面和网页链接空间信息与空间检索词的相似度度量。其次重点论述了不同类型空间数据发现机制,即空间数据服务、栅格、矢量及其他数据的发现方法,针对不同类型,讨论其在网页中的表现形式,解析的基本过程,其中对涉及主要算法与模型,给出了必要说明与阐述。论文提出了Web空间数据的置信度度量方法。Web空间数据由于描述信息缺乏,其数据质量很难准确衡量,后期数据检索与应用相对困难。结合空间数据质量的一些基本方法,综合考虑空间数据文本描述与数据本身信息,提出了定性度量矢量、栅格数据的方法。其次,对不同空间数据类型置信度做了分析比较,对链接到同一空间敏感页面的不同资源,选取较大置信度对整个页面最佳匹配。论文结合元数据模型与目前空间数据分类体系,提出了Web空间数据的分类标签思想。Web环境下空间数据由于表达尺度、范围、要素等等差异,很难采用传统的分类体系对其划分,必须采用新的方式记录其数据描述信息,借助元数据模型及数据应用相关的分类体系,提出了分类标签体系模型。在此基础上,对Web数据获取后,数据的存储管理,后期数据检索与应用做了简单说明。通过实例模型,对整个空间敏感爬虫从页面过滤,到信息提取,再到质量的基本评价,进行了必要的验证。分析、总结了相关理论与实践之间存在的不一致性问题,表明了网络空间数据爬取问题的复杂性,为后续研究奠定一定的理论与实践基础。最后论文对基于空间信息爬取基本整体流程的各个环节进行了总结,提出了下一步研究的几个方向。
[Abstract]:The rapid development of Web technology provides a wealth of information and brings a lot of information redundancy. It is one of the common problems in the network retrieval that how to quickly locate the user's needs. Especially in the space information field, the spatial data involves two kinds of information, geometry and property. The uniqueness of this information can only be passed in the network environment. Two aspects of text description information and geometric graphic information are presented respectively. At present, the retrieval of spatial information mainly focuses on the matching of text description, and the research on spatial geometric information retrieval is relatively small.
On the basis of analyzing the existing problems of spatial information retrieval under the current network environment, this paper discusses the main research fields in solving spatial information retrieval and the progress of research at home and abroad in these fields. The paper starts with the crawling of network information, and discusses the main features and classification system of spatial information in the network environment. In the same type of spatial data analysis and recognition method, the basic method of data confidence measurement is expounded for different data types and corresponding pages. At the same time, the spatial data classification system is extended, and the idea of crawling spatial data classification and labeling system is proposed. Based on this system, spatial data storage management and later application are realized. Finally, an example model is adopted. The process of spatial data crawling is verified, and the corresponding quality evaluation and analysis are made.
In view of different spatial data types, the basic principles and methods of crawling data based on spatial information sensitive crawlers are deeply discussed. Firstly, the concept of space sensitive crawler is introduced, and the similarities and differences with traditional crawlers are introduced, and the similarity measure between space sensitive pages and web link space information and space retrieval words is also introduced. Secondly, the similarity measure of space sensitive pages and Web links space information and space retrieval words is introduced. This paper focuses on different types of spatial data discovery mechanism, that is, spatial data service, grid, vector and other data discovery methods. In view of different types, it discusses its form in the web page and the basic process of parsing. It gives the necessary explanation and exposition of the main algorithms and models.
The paper puts forward the confidence measure of Web spatial data,.Web spatial data is difficult to accurately measure the data quality because of lack of description information. The later data retrieval and application is relatively difficult. Combined with some basic methods of spatial data quality, the qualitative measurement vector is put forward with the comprehensive consideration of the text description of spatial data and the information of data itself. Secondly, the confidence degree of different spatial data types is analyzed and compared, and the different resources linked to the same space sensitive page are used to select the best confidence for the best matching of the whole page.
Based on the metadata model and the current spatial data classification system, this paper puts forward the classification label idea of Web spatial data, which is difficult to use traditional classification system to divide the spatial data in.Web environment. On the basis of Web data acquisition, data storage and management, and later data retrieval and application are simply explained.
Through the example model, the necessary verification is carried out on the whole space sensitive crawler from page filtering, information extraction, and then to the basic evaluation of quality. Analysis is made and the inconsistency between the related theory and practice is summarized, which shows the complexity of the network spatial data crawling problem and lays a certain theory and Practice for the follow-up research. Basics.
Finally, the paper summarizes the links of the basic process based on spatial information crawling, and puts forward several directions for further research.

【学位授予单位】：武汉大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：P208

【参考文献】