比价购物平台中网络爬虫的设计与实现

发布时间：2018-12-31 08:10

【摘要】：随着信息技术的普及与发展, Internet已深入到人们生活与工作的各个角落,搜索引擎已成为人们获取信息最快捷的工具,网上购物已成为一种生活方式,越来越被大多数人接受。但是网上商品种类繁多、价格高低不同和商家良莠不齐,消费者不得不花费大量的时间在各大购物网站浏览商品、比较价格、权衡性价比,因此,用户很希望拥有这样一套系统来帮助他们完成对商品的选购,在这套系统中包含了各大主流购物网站中热卖产品的信息,通过简单的搜索就能够知道哪个网站售卖的商品最便宜、性价比最高。比价购物平台是一个很好的解决方案,对于该平台来说,如何获取如此庞大的商品数据和价格信息是一个至关重要的问题,正是基于以上背景,本文提出针对其数据来源的解决方案——网络爬虫的设计与实现。本文主要围绕如何设计和实现网络爬虫功能进行研究,在Heritrix网络爬虫的基础上,对某些功能做扩展和定制化开发,本文主要就以下几个问题作了深入讨论： (1)确定种子链接：为网络爬虫提供一个爬行入口; (2)网页抓取的方法：将符合要求的网页保存到本地文件夹； (3)分析和抽取网页内容：提取网页中与商品属性有关的信息； (4)结构化与存储数据：将商品属性逐条提取出来并存储到数据库中； (5)展现商品数据,用于比价。
[Abstract]:With the popularization and development of information technology, Internet has penetrated into every corner of people's life and work. Search engine has become the quickest tool for people to obtain information. Online shopping has become a way of life and more accepted by most people. But there are many kinds of goods on the net, the price is different and the good are not the same, consumers have to spend a lot of time browsing the goods in the major shopping websites, comparing the prices, weighing the performance-to-price ratio, so, Users are keen to have a system to help them complete their shopping choices, which contain information about popular products from major shopping sites. A simple search can tell which sites sell the cheapest and most cost-effective products. Price comparison shopping platform is a good solution, for this platform, how to obtain such huge commodity data and price information is a crucial problem, it is based on the above background, This paper presents a solution for its data source, the design and implementation of web crawler. This paper mainly focuses on how to design and realize the function of web crawler. On the basis of Heritrix crawler, some functions are extended and customized. In this paper, the following problems are discussed: (1) to determine the seed link: to provide a crawling portal for the web crawler; (II) method of web page crawling: save pages that meet the requirements to a local folder; (3) analyzing and extracting web content: extracting information related to commodity attributes in web pages; (4) structuring and storing data: extracting commodity attributes one by one and storing them in database; (5) display commodity data for price comparison.
【学位授予单位】：华东理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【引证文献】