IPv6信息采集系统的设计与实现
发布时间:2018-04-17 04:11
本文选题:IPv6资源 + 信息采集 ; 参考:《华南理工大学》2012年硕士论文
【摘要】:随着互联网的快速发展,网络资源越来越丰富,使得通用信息采集系统和搜索引擎面临着巨大的挑战。人们对信息服务的要求越来越高、越来越专业,通用搜索引擎不能满足用户对专业信息领域的需求。在这种情况下,主题信息采集应运而生。当前IPv4地址已经枯竭,正在向IPv6发展。中国IPv6地址数量也在近一年内飞速增长,人们对IPv6资源的需求也越来越大。这种情况下,我们需要IPv6主题信息采集系统更快地抓取的IPv6的资源。 本文旨在设计并实现一个高效的、健壮的、可配置的、准确的IPv6主题信息采集系统,,为搜索引擎提供可靠的IPv6资源,以满足人们对IPv6资源的需求。本文首先研究国内外信息采集系统的发展状况。然后介绍搜索引擎的相关理论知识,主要包括搜索引擎的发展、信息采集的基本原理、主题爬虫和网页分析的算法。使用分布式系统的框架和MVC的分层模式设计来实现IPv6信息采集系统。系统中加入DNS缓存、robots缓存、站点信息缓存来改善系统的性能。本文还提出了教育网站点优先、大站点优先和基于站点链接结构的分值传递的采集策略来指导采集系统进行IPv6资源。使用RMI技术实现分布式节点间的通信,主节点向子节点发送执行命令,从节点通过发送心跳信息给主节点报告节点状态。 本文对系统进行以下测试: DNS缓存效果测试、系统采集性能测试、IPv6采集策略效果测试,并在采集IPv6资源后进行站点信息的统计,获取和分析IPv6站点的拓扑结构及资源分布。
[Abstract]:With the rapid development of the Internet, the network resources are more and more abundant, which makes the general information collection system and search engine face enormous challenges.The requirement of information service is getting higher and higher, and the general search engine can not meet the needs of users in the field of professional information.In this case, subject information collection emerged as the times require.The current IPv4 address has dried up and is moving towards IPv6.The number of IPv6 addresses in China has also increased rapidly in the past year, and the demand for IPv6 resources is also increasing.In this case, we need the IPv6 topic Information Collection system to grab the IPv6 resources faster.The purpose of this paper is to design and implement an efficient, robust, configurable and accurate IPv6 subject information collection system, and to provide reliable IPv6 resources for search engines to meet the needs of IPv6 resources.Firstly, this paper studies the development of information collection system at home and abroad.Then it introduces the relevant theoretical knowledge of search engine, including the development of search engine, the basic principle of information collection, the subject crawler and the algorithm of web page analysis.The framework of distributed system and the layered design of MVC are used to realize the IPv6 information collection system.System add DNS cache robots cache, site information cache to improve the system performance.This paper also proposes a collection strategy of education network site priority, large site priority and value transfer based on site link structure to guide the acquisition system to carry out IPv6 resources.The communication between distributed nodes is realized by using RMI technology. The master node sends the execution command to the child node and the slave node reports the state of the node by sending heartbeat information to the master node.This paper tests the system as follows: DNS cache effect test, system acquisition performance test and IPv6 acquisition strategy effect test. After collecting IPv6 resources, the site information is counted and the topology structure and resource distribution of IPv6 site are obtained and analyzed.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.092
【参考文献】
相关期刊论文 前6条
1 汪涛,樊孝忠,顾益军,刘林;基于概念分析的主题爬虫设计[J];北京理工大学学报;2004年10期
2 黄皓凌;张凡;;6搜-高效的专用IPv6搜索引擎[J];电子设计工程;2011年23期
3 印鉴,陈忆群,张钢;搜索引擎技术研究与发展[J];计算机工程;2005年14期
4 汪涛,樊孝忠;链接分析对主题爬虫的改进[J];计算机应用;2004年S2期
5 韩客松,王永成;一种用于主题提取的非线性加权方法[J];情报学报;2000年06期
6 李学勇,欧阳柳波,李国徽;非贪婪策略在WEB搜索中的应用[J];中央民族大学学报(自然科学版);2004年03期
本文编号:1761998
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1761998.html