基于主题的微博网页爬虫研究

发布时间：2018-04-29 01:33

本文选题：网页页面分析 + 微博爬虫　；参考：《武汉理工大学》2014年硕士论文

【摘要】：随着美国twitter的火热，国内各大微博网站兴起，微博在网民中日益火热。在微博中诞生的各种网络热词也迅速走红网络，微博效应正在逐渐形成，微博成为中国网民上网的主要活动之一。正是由于微博效应的形成，微博话题在网民之间迅速传递。对于微博信息的获取以及分析，成为重要的研究对象。为方便微博数据的获取，各大网站微博也相继提供了抓取微博的API，但这些API都有访问次数的限制，，无法满足获取大量微博数据的要求，同时抓取的数据往往很杂乱。针对上述问题，本文引入网页页面分析技术和主题相关性分析技术，展开基于主题的微博网页爬虫的研究与设计。本文的主要工作有研究分析网页页面分析技术，根据微博页面特点选择微博页面信息获取方法；重点描述基于“剪枝”的广度优先搜索策略的思考以及设计的详细过程，着重解决URL的去重、URL地址集合动态变化等问题；研究分析短文本主题抽取技术以及多关键匹配技术，确定微博主题相关性分析的设计方案；最后设计实现基于主题的微博网页爬虫的原型系统，实时抓取和存储微博数据。本文研究的核心问题是，根据微博数据的特点设计一种基于“剪枝”的广度优先搜索策略，并将其应用到微博爬虫中；同时使用微博页面分析技术使得爬虫不受微博平台API限制，从而让用户尽可能准确地抓取主题相关的微博数据。通过多次反复实验获取原型系统实验结果，将实验结果同基于API微博爬虫和基于网页微博爬虫的抓取效果进行对比分析得出结论：本文提出的爬行策略能够抓取主题相关的微博数据，虽然在效率上有所降低，但在抓取的微博数据具有较好的主题相关性。这实验结果证明本论文研究的实现方案是可行的。
[Abstract]:With the popularity of twitter in the United States and the rise of Weibo websites in China, Weibo is becoming more and more popular among Internet users. All kinds of network hot words born in Weibo are also becoming popular in the Internet, and Weibo effect is gradually forming. Weibo has become one of the main activities of Internet users in China. Precisely because of the formation of Weibo effect, Weibo topic passes quickly among the netizen. For Weibo information acquisition and analysis, become an important research object. In order to facilitate the acquisition of Weibo data, Weibo has also provided the API of Weibo, but these API can not meet the requirements of obtaining a large number of Weibo data because of the limitation of access times. At the same time, the fetched data is often very messy. Aiming at the above problems, this paper introduces the technology of web page analysis and theme correlation analysis, and develops the research and design of Weibo web crawler based on topic. The main work of this paper is to study and analyze the technology of page analysis, to select the method of obtaining the information of Weibo page according to Weibo's page characteristics, and to describe the thinking and design process of the breadth-first search strategy based on "pruning". In order to solve the problem of dynamic change of URL's reshuffling URL address set, this paper studies and analyzes the technology of extracting short text and multi-key matching technology, and determines the design scheme of Weibo's theme correlation analysis. Finally, a prototype system of Weibo web crawler based on theme is designed and implemented, which can capture and store Weibo data in real time. The core problem of this paper is to design a breadth-first search strategy based on pruning according to the characteristics of Weibo data, and apply it to Weibo crawler. At the same time, using Weibo page analysis technology, the crawler is not restricted by the API platform, so that users can capture the data of the topic as accurately as possible. The experimental results of the prototype system are obtained by repeated experiments. The experimental results are compared with those based on API Weibo crawler and web page Weibo crawler. It is concluded that the crawling strategy proposed in this paper can capture data related to the subject, although the efficiency is somewhat lower. But Weibo data in the capture has a better thematic correlation. The experimental results show that the scheme is feasible.
【学位授予单位】：武汉理工大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】