一种针对Deep Web深层数据获取的网络爬虫设计与实现

发布时间：2018-04-14 19:27

本文选题：网络爬虫 + Deep　；参考：《华中师范大学》2013年硕士论文

【摘要】：当今的信息时代,互联网中的信息不断快速增长,存储数据非常容易,但是想从中找到有用的信息却越来越难。通用搜索引擎的出现为这一问题提供了解决方案。网络中有一部分数据信息是存放在各网站站点数据库中的,这部分数据信息不能通过网页中的链接直接获取,而需要用户手动填写网站查询表单,提交查询命令才能访问,这些数据被称为Deep Web数据。Deep Web数据与其他网站静态页面中提供的信息相比,专业性更强,数据量更大,对用户而言,更具利用有价值。通用搜索引擎在进行网络信息爬取时,无法爬取到Deep Web数据,这对搜索引擎用户而言,能够获得的有价值的信息有限。 E时代辛亥革命搜索引擎是一个为辛亥革命历史事件的研究者提供检索服务的垂直搜素引擎,网络爬虫子系统的研制就是该引擎必须解决的关键系统之一。本文在通用搜索引擎的基础上,通过对Deep Web数据结构特点进行分析,针对Deep Web数据源的检测和获取提供了一套实现方案,解决了其中2个主要问题,即： 1.对Deep Web数据查询接口的节点特征进行分析,建立节点特征库。网络爬虫在获取到新的页面时,利用节点特征库比对当前网页的节点特征,寻找当前网页中包含Deep Web数据源的可能性,以此实现爬虫进行数据爬取时,自动发现Deep Web数据并将相关信息记录至文件中。 2.爬虫能够读取Deep Web文件,拼装Deep Web数据源的查询请求,获取站点返回信息；通过页面相似度的计算,为查询结果页面寻找一个“同类网页”；通过对查询结果页面和其“同类网页”进行结构特征分析,从查询结果页面中提取出查询结果链接和分页链接,舍弃导航链接和广告链接等。研究和实验表明,Deep Web数据源的检测和获取模型能够比较好地发现站点页面的查询接口,较为准确地抽取Deep Web查询结果。
[Abstract]:In today's information age, the information in the Internet is growing rapidly, storing data is very easy, but finding useful information from it is more and more difficult. The emergence of general search engine provides a solution for this problem.
The network has a portion of the data is stored in the web site information in the database, this part of the data can not be obtained directly through the web page of the link, and require the user to manually fill in Web query form, submit query commands can be accessed, compared these data is called the information provided by the Deep Web.Deep data Web data and other static website in the page, more professional, more large amount of data, for users, more use value. The general search engine in web crawling, unable to take up Deep Web data, the search engine users, can obtain the valuable information.
The era of the E revolution is a search engine on the 1911 Revolution of historical events to provide search services in vertical search engine, one of the key system of network system is the development of climbing worm engine must be solved. In this paper, the general search engine, based on the structural characteristics of the Deep Web Deep Web for data analysis. The data source detection and acquisition provides a set of implementation scheme, solve 2 key problems, i.e.:
1. of the Deep Web query interface node feature analysis, set up the node feature database. Web crawler in access to the new page, using the node feature library than node features of the current web page, looking for the possibility of containing the Deep Web data source in the web page, so as to realize the crawler crawling, automatic discovery of Deep Web the data and relevant information will be recorded to a file.
2. crawlers can read the Deep Web file, assembled Deep Web data source query, access to the site to return information; by calculating the similarity to the query results page, page for a "similar" "; through the pages of search results and the" similar "to the analysis of structure features, extracted from the query results page query the link and paging link, abandon navigation links and advertising links.
Research and experiments show that the detection and acquisition model of Deep Web data source can find the query interface of site pages better, and extract the results of Deep Web more accurately.

【学位授予单位】：华中师范大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】