利用Nutch研究与实现支持Ajax动态网页的网络爬虫系统
发布时间:2018-10-22 12:28
【摘要】:随着Web2.0的快速发展,网站对于Ajax技术的应用越来越多。Ajax技术通过异步调用,进行页面局部刷新,在很大程度上提高了用户的体验度、减少了网络传输流量以及提高了网站的访问速度等。在Ajax技术使得互联网的交互模式发生变革的同时,也给用户和开发人员带来了一系列的问题。例如JavaScript代码的使用和编写不规范、浏览器的不兼容性、页面请求次数过多、Ajax技术的滥用造成的服务器负担过重等许多问题。 爬虫系统属于搜索引擎中的一个必备的数据采集子系统,搜索引擎根据爬虫系统采集的数据建立索引后,对用户提供搜索服务。Ajax技术的大量使用对于搜索引擎也有着重要的影响。传统的搜索引擎只提供了对静态页面的数据的搜索服务,对由Ajax技术产生的动态数据却不能提供搜索服务。Ajax技术的大量使用造成了由Ajax技术生成的页面动态数据的日益庞大。这部分动态数据在数据分析、数据挖掘等方面都具有重要的意义。例如新浪新闻上面的部分评论是通过Ajax技术动态生成的,这部分数据的采集对于国家安全方面是有着重要意义的。 本论文通过对Nutch进行改进,增加部分模块,建立了一个能够爬取Ajax动态数据的网络爬虫系统,,并且根据数据建立了索引,对用户提供了搜索服务。
[Abstract]:With the rapid development of Web2.0, the application of Ajax technology is more and more. Ajax technology through asynchronous calls to carry out local page refresh, to a large extent, improve the user's experience, It reduces the network traffic and improves the visiting speed of the website. While Ajax technology changes the interaction mode of the Internet, it also brings a series of problems to users and developers. For example, the use and writing of JavaScript code is not standardized, the browser is not compatible, the number of page requests is too many, the abuse of Ajax technology caused by the excessive burden of servers and many other problems. The crawler system belongs to a necessary data collection subsystem in the search engine. After the search engine establishes the index according to the data collected by the crawler system, Providing search services to users. The extensive use of Ajax technology also has an important impact on search engines. The traditional search engine only provides the search service for the static page data, but not the search service for the dynamic data generated by the Ajax technology. The extensive use of the Ajax technology has resulted in the increasing volume of the page dynamic data generated by the Ajax technology. This part of dynamic data is of great significance in data analysis and data mining. For example, some of the comments above Sina News are generated dynamically through Ajax technology, and the collection of data is of great significance to national security. In this paper, we improve Nutch, add some modules, build a web crawler system which can crawl Ajax dynamic data, build index according to the data, and provide search service to users.
【学位授予单位】:内蒙古师范大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
本文编号:2287161
[Abstract]:With the rapid development of Web2.0, the application of Ajax technology is more and more. Ajax technology through asynchronous calls to carry out local page refresh, to a large extent, improve the user's experience, It reduces the network traffic and improves the visiting speed of the website. While Ajax technology changes the interaction mode of the Internet, it also brings a series of problems to users and developers. For example, the use and writing of JavaScript code is not standardized, the browser is not compatible, the number of page requests is too many, the abuse of Ajax technology caused by the excessive burden of servers and many other problems. The crawler system belongs to a necessary data collection subsystem in the search engine. After the search engine establishes the index according to the data collected by the crawler system, Providing search services to users. The extensive use of Ajax technology also has an important impact on search engines. The traditional search engine only provides the search service for the static page data, but not the search service for the dynamic data generated by the Ajax technology. The extensive use of the Ajax technology has resulted in the increasing volume of the page dynamic data generated by the Ajax technology. This part of dynamic data is of great significance in data analysis and data mining. For example, some of the comments above Sina News are generated dynamically through Ajax technology, and the collection of data is of great significance to national security. In this paper, we improve Nutch, add some modules, build a web crawler system which can crawl Ajax dynamic data, build index according to the data, and provide search service to users.
【学位授予单位】:内蒙古师范大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前5条
1 查志华;李伟;;搜索引擎的技术现状及发展趋势[J];兵团教育学院学报;2006年03期
2 赵志宏;黄蕾;刘峰;陈振宇;;Deep Web搜索技术进展综述[J];山东大学学报(工学版);2009年02期
3 郑冬冬;崔志明;;Deep Web爬虫爬行策略研究[J];计算机工程与设计;2006年17期
4 胡少荣;孟嗣仪;刘云;张彦超;丁飞;;网页信息自动抽取技术的研究[J];铁路计算机应用;2010年09期
5 严亚兰;面向动态网页爬行的Crawler架构[J];图书情报知识;2003年04期
相关硕士学位论文 前6条
1 王佳;支持Ajax技术的主题网络爬虫系统研究与实现[D];北京交通大学;2011年
2 罗兵;支持AJAX的互联网搜索引擎爬虫设计与实现[D];浙江大学;2007年
3 肖卓磊;基于Ajax技术的搜索引擎研究[D];武汉理工大学;2009年
4 袁小节;基于协议驱动与事件驱动的综合聚焦爬虫研究与实现[D];国防科学技术大学;2009年
5 曾伟辉;支持AJAX的网络爬虫系统设计与实现[D];中国科学技术大学;2009年
6 庄重;WEB信息抽取的研究[D];湖北工业大学;2009年
本文编号:2287161
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2287161.html