利用Nutch研究与实现支持Ajax动态网页的网络爬虫系统

发布时间：2018-10-22 12:28

【摘要】：随着Web2.0的快速发展，网站对于Ajax技术的应用越来越多。Ajax技术通过异步调用，进行页面局部刷新，在很大程度上提高了用户的体验度、减少了网络传输流量以及提高了网站的访问速度等。在Ajax技术使得互联网的交互模式发生变革的同时，也给用户和开发人员带来了一系列的问题。例如JavaScript代码的使用和编写不规范、浏览器的不兼容性、页面请求次数过多、Ajax技术的滥用造成的服务器负担过重等许多问题。爬虫系统属于搜索引擎中的一个必备的数据采集子系统，搜索引擎根据爬虫系统采集的数据建立索引后，对用户提供搜索服务。Ajax技术的大量使用对于搜索引擎也有着重要的影响。传统的搜索引擎只提供了对静态页面的数据的搜索服务，对由Ajax技术产生的动态数据却不能提供搜索服务。Ajax技术的大量使用造成了由Ajax技术生成的页面动态数据的日益庞大。这部分动态数据在数据分析、数据挖掘等方面都具有重要的意义。例如新浪新闻上面的部分评论是通过Ajax技术动态生成的，这部分数据的采集对于国家安全方面是有着重要意义的。本论文通过对Nutch进行改进，增加部分模块，建立了一个能够爬取Ajax动态数据的网络爬虫系统，，并且根据数据建立了索引，对用户提供了搜索服务。
[Abstract]:With the rapid development of Web2.0, the application of Ajax technology is more and more. Ajax technology through asynchronous calls to carry out local page refresh, to a large extent, improve the user's experience, It reduces the network traffic and improves the visiting speed of the website. While Ajax technology changes the interaction mode of the Internet, it also brings a series of problems to users and developers. For example, the use and writing of JavaScript code is not standardized, the browser is not compatible, the number of page requests is too many, the abuse of Ajax technology caused by the excessive burden of servers and many other problems. The crawler system belongs to a necessary data collection subsystem in the search engine. After the search engine establishes the index according to the data collected by the crawler system, Providing search services to users. The extensive use of Ajax technology also has an important impact on search engines. The traditional search engine only provides the search service for the static page data, but not the search service for the dynamic data generated by the Ajax technology. The extensive use of the Ajax technology has resulted in the increasing volume of the page dynamic data generated by the Ajax technology. This part of dynamic data is of great significance in data analysis and data mining. For example, some of the comments above Sina News are generated dynamically through Ajax technology, and the collection of data is of great significance to national security. In this paper, we improve Nutch, add some modules, build a web crawler system which can crawl Ajax dynamic data, build index according to the data, and provide search service to users.
【学位授予单位】：内蒙古师范大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】