面向质量安全的元搜索数据采集系统的设计与实现

发布时间：2019-05-11 07:53

【摘要】：目前质量安全问题频发，并且随着互联网的普及，质量安全问题越来越多的在互联网上被大众讨论。人们在互联网上发表的关于质量安全的评论和互联网媒体对质量安全方面的报道都可以作为质量安全分析的文本语料。因此互联网可以成为质量安全信息获取的数据源，为质量安全分析提供了数据基础。本文设计与实现了基于元搜索的数据采集系统，负责采集质量安全相关方面的网页。本文中，，元搜索引擎不再是传统的使用方式，而是用于根据用户设定的查询词来进行数据采集。系统在功能上主要分为元搜索查询、网页抽取、相关性判定三个功能块。在元搜索功能块中完成了不同元搜索引擎的封装，同时对查询采用了优先级调度方式的管理。在网页抽取功能块中采用了基于模板解析和基于统计解析两种方式：基于模板解析主要负责结果链接的抽取、基于统计的解析则作为通用的正文抽取方法。在相关性判定功能块中，采用了支持向量机的分类算法来筛选质量安全相关数据，去除噪音信息。本文最后对网页抽取效果与分类效果进行了测试，并展示了系统运行成果。由于质量安全相关数据在互联网上较为分散、数据特征明显的特点，本文放弃了使用定向爬虫模式采集数据，而在元搜索引擎用于数据采集作了一次尝试。本文对其他领域的数据采集研究有一定的借鉴意义。
[Abstract]:At present, quality and safety problems occur frequently, and with the popularity of the Internet, quality and safety issues are more and more discussed by the public on the Internet. Comments on quality and safety published on the Internet and Internet media reports on quality and safety can be used as textual data for quality and safety analysis. Therefore, the Internet can become the data source of quality and safety information acquisition, which provides the data basis for quality and safety analysis. In this paper, a data acquisition system based on meta-search is designed and implemented, which is responsible for collecting web pages related to quality and safety. In this paper, meta-search engine is no longer the traditional way to use, but is used to collect data according to the query words set by the user. The function of the system is mainly divided into three functional blocks: meta-search query, web page extraction and correlation determination. The different meta-search engines are encapsulated in the meta-search function block, and the query is managed by priority scheduling. In the function block of web page extraction, two methods based on template analysis and statistical analysis are adopted: template analysis is mainly responsible for the extraction of result links, and statistical analysis is used as a general text extraction method. The classification algorithm of support vector machine is used to filter the quality and safety related data and remove the noise information in the correlation decision function block. Finally, the paper tests the effect of web page extraction and classification, and shows the results of the system. Because the quality and safety related data are scattered on the Internet and the data characteristics are obvious, this paper abandons the use of targeted crawler mode to collect data, and makes an attempt to use meta-search engine for data acquisition. This paper has certain reference significance to other fields of data acquisition research.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP274.2

【参考文献】