智能Web广告爬虫系统研究
发布时间:2018-02-04 00:33
本文关键词: Web广告 爬行策略 信息抽取 页面分块 聚类 出处:《哈尔滨工业大学》2013年硕士论文 论文类型:学位论文
【摘要】:近年来,随着互联网越来越深入的影响人们的日常生活,互联网也演变为除电视、报纸外一个非常重要的广告传播媒介。Web广告由于其覆盖面广、交互性强等特质,吸引了众多的广告主在互联网上进行营销。在互联网上投放的广告数据非常之多,收集这些数据是一份很有意义的工作,但是目前却没有针对这些Web广告数据的采集器。 本文提出并设计了一个Web广告爬虫系统,专门用来收集互联网中的广告数据。本文主要做了如下三个方面的工作: (1)设计了针对Web广告信息抓取的爬行策略,爬行策略通过计算URL种子的权重来安排URL种子的下载顺序。结合Web广告爬虫系统要抓取的广告对象类型和Web广告的投放方法,提出了已下载页面权重计算方法和种子链接权重计算方法,计算已下载页面权重,结合一些全局统计知识进一步计算种子链接的权重; (2)通过观察和分析大量不同类型网页中的广告数据,设计了针对Web广告信息的抽取方法,用于抽取网页中的广告数据。该方法根据网页中的广告数据呈现出来的局部性和聚集性,利用聚类算法将网页中的所有超链接聚合成超链接块,然后用启发式规则判断链接块的类别性质,,如果判断是广告块,抽取广告块中的广告数据; (3)在以上研究成果的基础上设计并实现了一个智能Web广告爬虫系统,该系统从预设的URL种子开始,自动的从互联网中下载网页数据,然后抽取网页中的广告数据。实验结果表明,智能Web广告爬虫系统的爬行策略与广度优先策略和深度优先策略相比,能够更高效的抓取互联网中的广告数据,同时,广告信息抽取算法也能够精准的抽取网页中的广告数据。
[Abstract]:In recent years, with the Internet more and more in-depth impact on people's daily life, the Internet has also evolved into a very important advertising media besides television, newspaper. Web advertising has a wide coverage. Interactivity and other characteristics have attracted many advertisers to market on the Internet. There are so many advertising data on the Internet. It is a meaningful job to collect these data. But there is no collector for these Web advertising data. This paper proposes and designs a Web advertising crawler system, which is specially used to collect advertising data from the Internet. 1) the crawling strategy for Web advertising information capture is designed. The crawling strategy arranges the download order of URL seed by calculating the weight of URL seed, combined with the type of advertising object to be captured by Web crawler system and the method of Web advertisement delivery. The method of calculating the weight of downloaded page and the weight of seed link is put forward, the weight of downloaded page is calculated, and the weight of seed link is further calculated with some global statistical knowledge. By observing and analyzing a large number of advertising data in different types of web pages, a method of extracting advertising information for Web is designed. This method uses clustering algorithm to aggregate all hyperlinks into hyperlink blocks according to the locality and aggregation of advertisement data in web pages. Then the category nature of link block is judged by heuristic rule. If the judgment is an advertisement block, the advertisement data in the advertisement block is extracted. 3) based on the above research results, an intelligent Web advertising crawler system is designed and implemented. The system starts with the preset URL seed and automatically downloads the web page data from the Internet. The experimental results show that the crawling strategy of intelligent Web advertising crawler system is compared with breadth-first strategy and depth-first strategy. At the same time, advertising information extraction algorithm can extract advertising data from web pages accurately.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP393.09;TP391.1
【参考文献】
相关期刊论文 前1条
1 周德懋;李舟军;;高性能网络爬虫:研究综述[J];计算机科学;2009年08期
本文编号:1488791
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1488791.html