基于主题的微博网页爬虫研究
发布时间:2018-04-29 01:33
本文选题:网页页面分析 + 微博爬虫 ; 参考:《武汉理工大学》2014年硕士论文
【摘要】:随着美国twitter的火热,国内各大微博网站兴起,微博在网民中日益火热。在微博中诞生的各种网络热词也迅速走红网络,微博效应正在逐渐形成,微博成为中国网民上网的主要活动之一。正是由于微博效应的形成,微博话题在网民之间迅速传递。对于微博信息的获取以及分析,成为重要的研究对象。为方便微博数据的获取,各大网站微博也相继提供了抓取微博的API,但这些API都有访问次数的限制,,无法满足获取大量微博数据的要求,同时抓取的数据往往很杂乱。针对上述问题,本文引入网页页面分析技术和主题相关性分析技术,展开基于主题的微博网页爬虫的研究与设计。 本文的主要工作有研究分析网页页面分析技术,根据微博页面特点选择微博页面信息获取方法;重点描述基于“剪枝”的广度优先搜索策略的思考以及设计的详细过程,着重解决URL的去重、URL地址集合动态变化等问题;研究分析短文本主题抽取技术以及多关键匹配技术,确定微博主题相关性分析的设计方案;最后设计实现基于主题的微博网页爬虫的原型系统,实时抓取和存储微博数据。本文研究的核心问题是,根据微博数据的特点设计一种基于“剪枝”的广度优先搜索策略,并将其应用到微博爬虫中;同时使用微博页面分析技术使得爬虫不受微博平台API限制,从而让用户尽可能准确地抓取主题相关的微博数据。 通过多次反复实验获取原型系统实验结果,将实验结果同基于API微博爬虫和基于网页微博爬虫的抓取效果进行对比分析得出结论:本文提出的爬行策略能够抓取主题相关的微博数据,虽然在效率上有所降低,但在抓取的微博数据具有较好的主题相关性。这实验结果证明本论文研究的实现方案是可行的。
[Abstract]:With the popularity of twitter in the United States and the rise of Weibo websites in China, Weibo is becoming more and more popular among Internet users. All kinds of network hot words born in Weibo are also becoming popular in the Internet, and Weibo effect is gradually forming. Weibo has become one of the main activities of Internet users in China. Precisely because of the formation of Weibo effect, Weibo topic passes quickly among the netizen. For Weibo information acquisition and analysis, become an important research object. In order to facilitate the acquisition of Weibo data, Weibo has also provided the API of Weibo, but these API can not meet the requirements of obtaining a large number of Weibo data because of the limitation of access times. At the same time, the fetched data is often very messy. Aiming at the above problems, this paper introduces the technology of web page analysis and theme correlation analysis, and develops the research and design of Weibo web crawler based on topic. The main work of this paper is to study and analyze the technology of page analysis, to select the method of obtaining the information of Weibo page according to Weibo's page characteristics, and to describe the thinking and design process of the breadth-first search strategy based on "pruning". In order to solve the problem of dynamic change of URL's reshuffling URL address set, this paper studies and analyzes the technology of extracting short text and multi-key matching technology, and determines the design scheme of Weibo's theme correlation analysis. Finally, a prototype system of Weibo web crawler based on theme is designed and implemented, which can capture and store Weibo data in real time. The core problem of this paper is to design a breadth-first search strategy based on pruning according to the characteristics of Weibo data, and apply it to Weibo crawler. At the same time, using Weibo page analysis technology, the crawler is not restricted by the API platform, so that users can capture the data of the topic as accurately as possible. The experimental results of the prototype system are obtained by repeated experiments. The experimental results are compared with those based on API Weibo crawler and web page Weibo crawler. It is concluded that the crawling strategy proposed in this paper can capture data related to the subject, although the efficiency is somewhat lower. But Weibo data in the capture has a better thematic correlation. The experimental results show that the scheme is feasible.
【学位授予单位】:武汉理工大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092
【参考文献】
相关期刊论文 前10条
1 段爱华;;基于网站结构分析页面信息提取的方法研究[J];电脑知识与技术;2008年23期
2 周民;邱雅;王华彬;;网络舆情分析中智能爬虫的设计[J];电脑知识与技术;2011年33期
3 赵前东;叶猛;;微博热点话题检测系统的设计与实现[J];电视技术;2013年03期
4 殷贤亮;李猛;;基于分块的网页主题信息自动提取算法[J];华中科技大学学报(自然科学版);2007年10期
5 王琦,唐世渭,杨冬青,王腾蛟;基于DOM的网页主题信息自动提取[J];计算机研究与发展;2004年10期
6 李聪;梁昌勇;马丽;;基于领域最近邻的协同过滤推荐算法[J];计算机研究与发展;2008年09期
7 李学勇,欧阳柳波,李国徽,钟敏娟;网络蜘蛛搜索策略比较研究[J];计算机工程与应用;2004年04期
8 常育红,姜哲,朱小燕;基于标记树表示方法的页面结构分析[J];计算机工程与应用;2004年16期
9 林海霞;原福永;陈金森;刘俊峰;;一种改进的主题网络蜘蛛搜索算法[J];计算机工程与应用;2007年10期
10 周德懋;李舟军;;高性能网络爬虫:研究综述[J];计算机科学;2009年08期
本文编号:1817814
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1817814.html