面向Ajax的搜索引擎技术研究

发布时间：2018-11-07 06:40

【摘要】：Web正在经历一场巨大的变革,Web2.0时代已经到来。在Web2.0的大背景下,有一项技术已经大获成功并拥有了重要的地位,那就是Ajax,它有效地把JavaScript和动态DOM操作结合了起来,并通过与服务器的异步通信来实现丰富的交互性和响应性。但是AJAX技术上的变化彻底粉碎了传统的“网页”的概念,而这恰恰是现有众多web技术的实现基础,所以它带来创新性的同时也带来了很大的挑战,主要体现在“网页”的可搜索性和可测试性。本文主要从可搜索性出发,分析传统网络搜索引擎在Ajax出现后遇到的技术瓶颈,并对当下支持Ajax应用的搜索引擎技术的研究现状做了全面的调查,着重介绍了Ajax爬虫技术的研究现状,虽然取得了一定的研究成果,但是还有很多有待解决的问题。由于Ajax单个页面中包含多个状态,本文引援了经典的状态转换图模型对Ajax应用进行建模,并介绍了一种基于状态转换图的单线程Ajax爬行算法,然后在此基础上提出了一种并行的爬行算法,实验证明其爬行性能得到了大幅提升。在并行爬虫的研究基础上,本文又创新的提出了Ajax搜索引擎原型系统,基于一个轻量级搜索引擎Nutch实现,有效利用其插件机制扩展其功能,让其支持了对Ajax页面的爬取、索引和检索,验证了本文观点的正确性和有效性。
[Abstract]:Web is undergoing a huge change, and the era of Web2.0 has arrived. In the context of Web2.0, one technology that has been hugely successful and important is Ajax, which effectively combines JavaScript with dynamic DOM operations. And through asynchronous communication with the server to achieve rich interaction and responsiveness. But the change in AJAX technology has completely shattered the traditional concept of "web page", which is the foundation of many existing web technologies, so it brings innovation and great challenges. This is mainly reflected in the searchability and testability of web pages. Based on the searchability, this paper analyzes the technical bottleneck of the traditional network search engine after the emergence of Ajax, and makes a comprehensive investigation on the current research status of the search engine technology supporting the Ajax application. This paper mainly introduces the research status of Ajax crawler technology. Although some research results have been obtained, there are still many problems to be solved. Because there are many states in a single Ajax page, this paper introduces the classical state transition graph model to model the Ajax application, and introduces a single-threaded Ajax crawling algorithm based on the state transition graph. Then, a parallel crawling algorithm is proposed, and it is proved by experiments that its crawling performance has been greatly improved. Based on the research of parallel crawler, this paper proposes a prototype system of Ajax search engine, which is based on a lightweight search engine Nutch. It can effectively use its plug-in mechanism to extend its function and enable it to support the crawling of Ajax pages. Indexing and retrieval verify the correctness and validity of this view.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】