基于网页分块的主题爬虫技术研究

发布时间：2018-02-22 19:46

本文关键词： 网页分块视觉信息标签属性主题链接块 Shark-Search算法　出处：《山东师范大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着Web信息的多元化发展以及信息量的膨胀速度日益加快,不仅存储成本提高,信息采集也变得越来越难。通用爬虫在工作过程中会消耗大量的网络带宽,造成系统资源的浪费。而且它不太关心搜索到的页面是否符合用户的搜索主题,往往会返回很多与用户并不感兴趣的页面。因此,为了提高爬取效率,改善用户体验度,产生了以主题爬虫为核心的垂直搜索引擎。主题爬虫在页面抓取过程中采取启发式搜索策略,通过计算页面与用户搜索主题的相关度,将与用户搜索主题不相关的页面过滤掉,只下载与主题相关的页面存入待访问队列。网上的信息丰富多彩,如何有效的获取并整合主题内容信息以及如何利用爬虫全面准确地下载主题相关网页是面临的关键技术挑战。本文通过研究主题爬虫技术领域已取得的研究成果,主要对网页分块处理以及候选链接搜索策略进行了深入研究。在基于标签信息和视觉信息的分块布局下,提出了引入主题链接块因子的候选链接搜索算法。具体主要工作如下:(1)基于标签属性与视觉信息进行网页分块。利用table标签和div标签的布局规律,结合CSS样式表和style属性中的视觉信息进行分块处理。首先根据网页设计规律制定分类规则,将内容块分为文本块、链接块和无关块三类。然后进行主题文本块提取,先利用标签属性值进行初步过滤,再与基准块进行相似度计算进行进一步过滤,得到最终符合条件的文本。利用主题链接块提取规则进行主题块匹配,过滤噪音链接,获取所需的主题链接块。本文选取的基于标签属性与视觉信息的分块方法在实际应用中易于实现,避免块间大范围盲目匹配,具有较低的时间和空间复杂度。(2)主题爬虫在爬取过程中,需要先计算待爬取链接队列中的链接权重,按照权重大小决定访问顺序。本文在Shark-Search算法的基础上引入主题链接块权重的概念,提出基于主题链接块的改进搜索策略对网页中的URL进行优先级预测。将链接块中所有子链接的锚文本作为链接相关度计算的主要影响因素,在Shark-Search算法的理论基础上,引入主题链接块权重概念,并结合了链接结构的影响。(3)为了保证系统的有效性,首先在不同的阈值下分别实现HITS算法、Shark-Search算法和本文算法,将三种算法的结果进行对比分析。实验数据证明本文系统在多个阈值设置下都优于其他两种算法。然后对三种算法下的查全率和信息量总和进行了详细比较,并针对语义明确的主题和抽象概念的主题漂移率进行了实验分析,结果证明改进系统性能更优秀。
[Abstract]:With the diversified development of Web information and the increasing expansion of information, not only the storage cost increases, but also the information collection becomes more and more difficult. The universal crawler will consume a lot of network bandwidth in the working process. It often returns many pages that are not of interest to the user. Therefore, in order to improve crawling efficiency and user experience, it does not care much about whether the search page is in line with the user's search theme. A vertical search engine with theme crawler as the core is produced. The topic crawler adopts heuristic search strategy in the process of page crawling. By calculating the correlation between the page and the user search theme, the pages that are not related to the user search theme are filtered out. Download only the topic-related pages into the queue to be visited. The information on the web is rich and colorful, How to effectively obtain and integrate the topic content information and how to use crawlers to download the relevant web pages are the key technical challenges. This paper mainly studies the partitioning of web pages and the strategy of candidate link search. Under the partitioning layout based on label information and visual information, A candidate link search algorithm based on topic link block factor is proposed. The main work is as follows: 1) partitioning web pages based on tag attributes and visual information. The layout rules of table tags and div tags are used. According to the rules of web page design, the content block is divided into three categories: text block, link block and irrelevant block. First, the label attribute value is used for preliminary filtering, and then the similarity calculation with the reference block is carried out to further filter, and finally the eligible text is obtained. The topic block extraction rule is used to match the topic block, and the noise link is filtered. The method based on label attribute and visual information is easy to implement in practical application, and avoid blind matching between blocks. The crawler with low time and space complexity needs to calculate the link weight in the queue of links to be crawled. This paper introduces the concept of topic link block weight based on Shark-Search algorithm. An improved search strategy based on topic link block is proposed to predict the priority of URL in web pages. The anchor text of all sub-links in the link block is taken as the main influencing factor in the calculation of link correlation, and based on the theory of Shark-Search algorithm, the anchor text of all sub-links in the link block is considered as the main influencing factor. This paper introduces the concept of topic link block weight, and combines the influence of link structure. In order to ensure the effectiveness of the system, we implement the HITS algorithm Shark-Search algorithm and the algorithm in this paper at different thresholds, respectively. The results of the three algorithms are compared and analyzed. The experimental data show that the system is superior to the other two algorithms in many threshold settings. Then, the recall rate and the sum of information under the three algorithms are compared in detail. The experimental results show that the improved system performance is better.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092;TP391.3

【相似文献】