互联网地理信息爬虫技术研究与应用

发布时间：2018-03-21 20:35

本文选题：地理信息　切入点：爬虫技术　出处：《山东农业大学》2017年硕士论文　论文类型：学位论文

【摘要】：传统地理信息数据采集通常是通过国家地理信息普查、实地勘察等方式获取数据。然而,随着社会的不断发展,居民区、道路等因素的不断变化,这种数据采集形式中数据成本高、工作量大、效率和时效性低等问题日渐突出。互联网的不断发展,互联网上交织的地理数据与日俱增,这些数据中隐藏着丰富的知识。从互联网中抓取相关的地理数据成为了地理信息来源的一个新渠道。互联网中蕴含着大量的地理信息数据,爬虫技术的诞生在一定程度上解决了Web数据获取的难题,但一般的通用爬虫很难对互联网中存在的地理信息进行有效的爬取。互联网地理信息爬行技术在总结归纳通用爬虫技术的基础上,不追求大的覆盖,将目标定为抓取与互联网地理信息内容相关的网络数据,使抓取工作更具针对性,通过互联网地理信息爬虫技术解决地理信息采集工作中数据成本高、工作量大、效率和时效性低等问题。本文的主要研究如下:(1)分析归纳互联网地理信息承载网站特点。结合浏览器工作原理,通过分析互联网地理信息承载网站的信息交互和展示方式,按照浏览器工作原理,从爬虫信息采集角度将浅层地理信息承载网站主要分为了三种类型:M-Dom类型、M-Render类型、M-Trigger类型;结合具体实验,对深层网络地理信息承载网站分析,重点研究了深网POI地理信息的承载网站的特点。(2)互联网地理信息获取技术研究。针对浅层网络地理信息采集场景,重点研究了单页面和列表页面的抓取方法;针对深网POI地理信息采集场景,总结了采集难点、采集技术,设计了两套内容检索词,研究了相关的抓取策略。(3)技术验证与原型系统开发。在方法、技术、策略的研究的基础上,设计了互联网地理信息采集原型系统,从系统的架构、功能、模块、核心逻辑等方面介绍了设计的细节,实现了原型系统并进行应用验证。
[Abstract]:The data of traditional geographic information collection is usually through the national geographic information survey, field survey data acquisition. However, with the continuous development of society, residential areas, changing roads and other factors, the data in the data collection form of high cost, heavy workload, low efficiency and timeliness issues have become increasingly prominent. The development of the Internet the Internet, geographic data interleaving data hidden in these grow with each passing day, rich knowledge. From the Internet to retrieve the related geographic data has become a new channel for geographic information sources. The Internet contains geographic information data, the birth of crawler technology to solve the problem of Web data acquisition to a certain extent, but the general crawler general difficult to exist in the Internet geographic information crawling effectively. Internet geographic information crawling technology after summarizing the general crawler On the basis of technology, not the pursuit of large coverage, set the target network data capture and Internet geographic information related to the content of the work, grab more targeted, through the Internet geographic information crawler technology to solve data geographic information collection work in high cost, heavy workload, low efficiency and timeliness of the research. Are as follows: (1) analyze the Internet geographic information website bearing characteristics. Combined with the working principle of the browser, through the information interaction analysis of Internet geographic information bearing site and display way, according to the working principle of the browser, from the perspective of shallow crawler information acquisition of geographic information bearing site is divided into three types: M-Dom type, M-Render type. M-Trigger type; combining with experiments, the site analysis of bearing Deep Web Geographic information, the website focuses on the bearing characteristics of deep web POI geographic information. (2) acquisition of Internet geographic information. For shallow network geographic information collection scene, focusing on the capture method of single page and list pages; for deep web POI geographic information collection scene, collection difficulties, summarized acquisition technology, design two sets of content retrieval words, and studied the related crawling strategy. (3) verification technology and prototype system development. In the method, technology, strategy research based on the design of the Internet geographic information acquisition prototype system, from system architecture, function module, the design details of the core logic, a prototype system is implemented and validated in application.

【学位授予单位】：山东农业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3;P208

【参考文献】