面向检验检疫领域主题爬虫的研究及系统实现
发布时间:2018-04-17 13:56
本文选题:网络爬虫 + 数据检索 ; 参考:《浙江大学》2017年硕士论文
【摘要】:近年来,全球信息数据总量在互联网的推动下急剧地增长,据国际数据公司(IDC)预计,至2020年,全球的数据总量将以每年50%的增长率达到40ZB,其中文件、视频、音频等非结构化信息占数据生产总量的90%。在这样的背景下,用户在数据海洋中对信息精度和深度的要求日益提高,特别是针对专业领域内的特殊查询需求,通用搜索引擎收集的信息驳杂而不精确。有鉴于此,本文把文本分类问题作为主要的研究对象,从垂直搜索引擎出发,深度探究了其中的数据采集、关键词搜索等技术,并以实际项目为依托,基于检验检疫这一特定的主题领域实现了具体的数据采集、搜索子系统。本文主要的贡献如下:1、概述了爬虫系统实现过程中的用到的关键技术,如网页去噪、正文提取、海量URL和文档的去重、NoSQL数据库等。此外为了应对网页中动态内容的解析和下载,本文提出了基于协议控制的JavaScript解析策略。2、分别列举讨论了基于网络拓扑、网页正文、用户访问行为的网页抓取策略,对比其优缺点后,本文提出了基于URL密度聚类的网页抓取策略,通过聚集簇的方式来对相关网页进行划分和抓取。3、对比传统的文本分类器的优缺点,本文结合词向量Word2vec和深度学习的方法,提出了基于Attention机制的层次化长短时分类网络用于文本分类任务,分别从单词和句子的维度提取结构化特征来将整个文本表征为特征向量。4、结合“973计划”中的子课题,本文实现了面向检验检疫领域的数据采集子系统和数据搜索子系统,数据采集、清洗、存储、分类和索引等服务部署在多台服务器构成的分布式环境中,有效地提高了计算性能和系统的稳定性。
[Abstract]:In recent years, the total amount of global information data has increased dramatically under the impetus of the Internet. According to IDC, an international data company, by 2020, the global data volume will reach 40ZB at an annual rate of 50%, including documents and videos.Unstructured information such as audio accounts for 90% of total data production.In this context, users in the data ocean of information accuracy and depth requirements are increasing, especially for the special query requirements in the specialized field, the information collected by the general search engine is complex and inaccurate.In view of this, this paper takes the text classification as the main research object, from the vertical search engine, deeply explores the technology of data collection, keyword search and so on, and relies on the actual project.Based on inspection and quarantine, a specific data acquisition and search subsystem is implemented.The main contributions of this paper are as follows: 1. The key technologies used in the implementation of the crawler system are summarized, such as page de-noising, text extraction, massive URL and document de-noSQL database, etc.In addition, in order to deal with the dynamic content parsing and downloading in web pages, this paper puts forward the JavaScript parsing strategy based on protocol control, and enumerates and discusses the web crawling strategy based on network topology, page text and user access behavior, respectively.After comparing its advantages and disadvantages, this paper proposes a web page grab strategy based on URL density clustering, classifies and grabs the relevant pages by clustering, compares the advantages and disadvantages of the traditional text classifier.Combined with word vector Word2vec and depth learning method, this paper proposes a hierarchical long-time classification network based on Attention mechanism for text classification tasks.The structured features are extracted from the dimension of words and sentences to represent the whole text as feature vectors. Combined with the subtopics of 973 Plan, this paper implements a data acquisition subsystem and a data search subsystem oriented to inspection and quarantine.Data acquisition, cleaning, storage, classification and indexing services are deployed in a distributed environment composed of multiple servers, which effectively improves computing performance and system stability.
【学位授予单位】:浙江大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1;TP393.092
,
本文编号:1763880
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1763880.html