具备web数据整合功能的负载均衡系统设计与实现

发布时间：2018-08-24 11:56

【摘要】：伴随着互联网技术的飞速发展,网络数据规模急剧膨胀,典型的有社交网络,比如Twitter每两天半就能产生十亿条推文,每天处理近2TB数据,电子商务领域,比如支付宝单日最高成功支付1.88亿笔,搜索引擎的代表Google每天处理的数据量达到20PB,吉尔德定律也预示着网络流量将持续膨胀。不光网络流的数据量急剧膨胀,网络流量的形式也日趋复杂,随着CDN技术,多源下载技术等发展与普及,网络流的形式不再是以往的简单连接形式,给网络流过滤带来了极大的挑战,目前,CDN厂商Akamai产生的流量占全球流量的40%, YouTube大量使用多源下载技术,其流量占北美流量的30%。对海量数据流进行过滤的DPI系统目前普遍采用分布式多级并行处理的方式对网络流进行分析与检测,分布式集群需要一个高效的负载均衡系统对流量进行分发,本文紧密围绕DPI系统前端的负载均衡系统的web数据整合,数据分流,流量调度等进行相关技术研究,开展了一系列的关键技术研究与系统实现工作。由于CDN技术以及多源下载技术的大量使用,导致网络流量中的会话可能是有多个连接组成,而往往DPI系统需要将整个会话分流至一台后端机才能进行完整的分析,目前已有的负载均衡技术不具备将该类会话保持完整的功能,本文中的web数据整合功能为了解决该问题,将同一域名下的服务端IP聚类成IP簇,以IP簇为分流单元进行分流,从而将同一内容提供商的流量汇聚至同一后端机进行分析,一方面解决了数据会话完整性的问题,另一方面同一内容提供商的内容会被众多用户访问,如果每个用户的访问的同一内容的流量都被后端机分析一次,将造成大量的重复性计算,导致宝贵计算资源的浪费,通过web数据的整合,负载均衡系统可以对冗余数据进行去重,减少DPI系统的大量重复性计算,从而节约计算资源,提高DPI系统的吞吐量。数据分流和流量调度是在web数据整合的基础上,以IP簇为流量调度单元依据DPI系统的反馈进行负载均衡,伴随着后端机负载的变化,IP簇实时的进行分裂与合并,从而均衡各个后端机的负载。
[Abstract]:With the rapid development of Internet technology, the scale of network data has expanded dramatically, such as social networks, such as Twitter can generate billions of tweets every two and a half days, deal with nearly 2TB data every day, e-commerce, For example, Alipay pays a maximum of 188 million payments a day, Google, the search engine's representative, handles 20 PBs a day, and Gillard's law indicates that network traffic will continue to swell. Not only the amount of data of network flow expands rapidly, but the form of network flow is becoming more and more complicated. With the development and popularization of CDN technology and multi-source download technology, the form of network flow is no longer the simple connection form in the past. It brings great challenges to network flow filtering. At present, Akamai manufacturers account for 40 percent of global traffic. YouTube uses multi-source download technology, and its traffic accounts for 30 percent of North American traffic. At present, DPI system which filters mass data streams generally uses distributed multilevel parallel processing to analyze and detect network flows. Distributed clusters need an efficient load balancing system to distribute traffic. This paper focuses on the research of web data integration, data flow, traffic scheduling and a series of key technology research and system implementation of load balancing system in front of DPI system. Due to the extensive use of CDN technology and multi-source download technology, the sessions in network traffic may be composed of multiple connections, and often the DPI system needs to split the whole session to a single back-end machine to complete the analysis. The existing load balancing technology does not have the function of keeping the session intact. In order to solve this problem, the web data integration function in this paper clustered the server IP under the same domain name into a IP cluster and split the IP cluster as the shunt unit. Thus, the traffic of the same content provider is aggregated to the same back-end machine for analysis. On the one hand, the problem of data session integrity is solved; on the other hand, the content of the same content provider will be accessed by many users. If the traffic of the same content visited by each user is analyzed once by the back-end machine, it will result in a large number of repetitive calculations, resulting in a waste of valuable computing resources, and the integration of web data. The load balancing system can remove redundant data and reduce a large number of repetitive calculations in DPI system, thus saving computing resources and improving the throughput of DPI system. Data streaming and traffic scheduling is based on the integration of web data, and the IP cluster is used as the traffic scheduling unit to balance the load according to the feedback of the DPI system. With the change of the backend machine load, the IP cluster is split and merged in real time. Thus balancing the load of each backend machine.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.06

【参考文献】