Intel HDSLB High-Performance Layer 4 Load Balancer — Quick Start and Application Scenarios

Preface and Background

Today, with the rapid development and widespread implementation of cloud computing, SDN, and NFV, as the number of users running business on the cloud continues to grow and data centers increase in scale, the economies of scale of cloud computing have become increasingly important. Therefore, the intensive system architecture logic of cloud computing determines that network performance is an everlasting topic. In the technical system of cloud networking, the pursuit of performance is not only pervasive across all aspects, but also extremely demanding. Every bit of performance improvement brings cost reduction, increased revenue, and greater product competitiveness.

Broadly speaking, we can divide the performance pursuits of cloud networking into several aspects: bandwidth performance of physical networks, tunnel forwarding performance of virtual networks, load balancing performance of layer 4 networks, and I/O processing performance of application layer networks. Especially today, as the network bandwidth demand of data centers and edge devices continues to grow, the performance of load balancers, which serve as the entry point to user business service networks, is critically important. This is exactly the topic of this series of articles — Intel HDSLB, a high-performance layer 4 load balancer implemented based on software-hardware fusion acceleration technology.

In this series of articles, to clearly introduce HDSLB, the author plans to expand content step by step across three levels: perceptual understanding, rational understanding, and in-depth analysis, and will share the following articles one by one. Stay tuned. :)

  1. Intel HDSLB High-Performance Layer 4 Load Balancer — Quick Start and Application Scenarios
  2. Intel HDSLB High-Performance Layer 4 Load Balancer — Basic Principles and Deployment Configuration
  3. Intel HDSLB High-Performance Layer 4 Load Balancer — Advanced Features and Code Analysis

Limitations of Traditional LB Technologies

Before diving into HDSLB, it is necessary to first review the basic concepts, types, functions, and principles of traditional LB (load balancing). In a modern IT system, LB functions to build a highly available, high-concurrency, and highly scalable backend server cluster, and is essentially a traffic distribution network element.

Throughout the long history of technological evolution, LB technology has always focused on development in the following areas:

  1. LB Algorithms: How can traffic be distributed "intelligently on-demand" to backend server clusters according to different application scenarios?
  2. LB High Availability: As an intermediate node in the traffic path, how can the LB network element guarantee its own high availability? Active-standby mechanism or multi-active mechanism?
  3. LB Reverse Proxy: How to provide reverse proxy and protocol processing capabilities for multiple types of L4-L7 network protocols such as TCP, UDP, SSL, HTTP, FTP, ALG?
  4. LB High Performance: How to support larger bandwidth, lower latency, higher CPS, and larger-scale backend server clusters?
  5. LB Clustering: How to provide better horizontal scaling capabilities for itself?
  6. And more.

NOTE: CPS (Connections-per-second) is a key performance indicator for load balancers, describing the ability of a load balancer to stably process TCP connection establishments per second.

In the past, we commonly use the following LB solutions:

Admittedly, these LB solutions are still widely used in LB scenarios at the user business layer today. However, correspondingly, they are facing problems such as performance bottlenecks, poor scalability, and low cloud adaptability in LB scenarios at the cloud infrastructure layer.

With the vigorous development of advanced technologies such as heterogeneous computing and software-hardware fusion acceleration, more and more new networking projects are now being developed around high-performance data plane technologies such as DPDK, DPVS, VPP, SmartNIC/DPU, to build a new generation of load balancing products more suitable for large-scale system platforms such as cloud computing. The Intel HDSLB discussed in this series is one of them.

Features and Advantages of HDSLB

The HDSLB (High Density Scalable Load Balancer) project was originally initiated by Intel, with the goal of building an industry-leading performance layer 4 (TCP/UDP) load balancer. Where:

  • High Density: Refers to the extremely high number of concurrent TCP connections and throughput of a single HDSLB node.
  • Scalable: Refers to that its performance can scale linearly with the increase in the number of CPU cores or total resources.

It is worth noting that in a complete LB system, HDSLB is positioned as a layer 4 load balancer, while layer 7 load balancers (e.g. Nginx, etc.) act as a special RS (real server) of HDSLB, and need to be mounted to the backend of HDSLB to provide higher-level load balancing capabilities.

Currently, Intel has released version v23.04 of HDSLB, and provides developers with the open source HDSLB-DPVS version hosted on Github, as well as the commercial HDSLB-VPP version with more advanced features opened to commercial partners.

As a typical representative of the new generation of load balancers, HDSLB has the following functional features:

  1. Higher Performance: It achieves single-node throughput of 150Mpps, 100 million-level concurrent TCP connections, 10 million-level new TCP connections per second, and 10Mpps-level elephant flow capacity, with industry-leading performance.
  2. Excellent Hardware Acceleration Capability: Based on Intel's hardware ecosystem, it makes full use of the instruction sets of Intel Xeon series CPUs, such as AVX2 and AVX512. It also leverages the hardware features of Intel E810 100GbE network cards, including smart NIC technologies such as SRIOV, FDIR, RSS, DLB, DSA, DDP (Dynamic Device Personalization), and ADQ (Application Device Queue). Based on Intel's fully optimized technology ecosystem, users can deeply optimize the software-hardware fusion acceleration solution according to the needs of their own business scenarios.
  3. Excellent Multi-core Scalability: Single-machine throughput can grow linearly with the number of CPU cores to a large extent.
  4. Flexible Horizontal Scaling Capability: Natively supports NFV and flexible horizontal scaling.
  5. Supports Multiple LB Algorithms: Including RR, WLC, Consistent Hash, etc.
  6. Supports Multiple LB Modes: Including FULL-NAT, SNAT, DNAT, DR, IP Tunneling (IPIP), etc.
  7. Supports HA Clusters: Implements active-standby high availability based on Keepalived, and supports session sync capability.

NOTE: In the following content, we will mainly discuss the HDSLB-VPP version.

Performance Parameters of HDSLB

Baseline Performance Data

For the most critical performance factors, we obtained baseline performance data officially recognized by Intel from the HDSLB test case of Volcano Engine.

  1. Test environment parameters:

  2. Test topology:

  3. In the 1~16 Core scenario, the 64-byte forwarding throughput (unit: Mpps) test results are shown in the figure below; higher results are better.

  4. In the 1~4 Core scenario, the TCP CPS (unit: K) test results are shown in the figure below; higher results are better.

From the above results, we can see that the single-core throughput performance of HDSLB-VPP reaches 8Mpps, and it has multi-core linear scalability. At the same time, the single-core TCP CPS performance of HDSLB-VPP reaches 880K, and it also has multi-core linear scalability.

Comparison with Competitors

We also obtained official performance test data from a horizontal comparison between HDSLB and the latest published performance data of an open source L4 LB solution.

Test environment parameters:

  • Open source L4 LB test environment parameters: CPU E5-2650, 2K~10K concurrent TCP sessions per core, 64-byte UDP traffic.
  • HDSLB-VPP test environment parameters: 3rd Generation Intel Xeon-SP CPU, 10M concurrent TCP sessions per core, 64-byte UDP traffic.

As can be seen from the first figure below, in the FNAT IPv4 throughput test case, even with a 10x increase in concurrent TCP sessions per core, HDSLB-VPP still achieves more than 3x the single-core throughput performance advantage, and has better multi-core linear scalability.

At the same time, in the FNAT throughput scenario, the consistent results across the three packet loss modes — MAX (best effort forwarding), PDR (0.01% packet loss rate), and NDR (zero packet loss rate) — also reflects that HDSLB-VPP has excellent forwarding stability. Furthermore, the results under LB modes such as NAT, DR, and IPIP follow the same trend as FNAT mode.

The second figure shows that the CPS (new TCP connections per second) performance of HDSLB-VPP is 5 times higher than the comparison solution.

In addition, HDSLB-VPP has deeply optimized data structure memory based on the VPP framework, allowing the maximum concurrent TCP sessions to break through the preset 100M (100 million) level under the same memory consumption. It can be expanded to 500M (500 million) level in FNAT mode, and even reach 1000M (1 billion) level in NAT mode.

The memory optimization and advantage in concurrent TCP session capacity of HDSLB-VPP make it more practical in IPv6 scenarios. While having the advantage of high performance, it can also save more system resources for deploying other businesses.

Application Scenarios of HDSLB

Based on the above features, HDSLB is currently mainly used as a L4 LB network element in cloud computing and edge computing.

For resource-intensive cloud computing scenarios, we need to address the following two key characteristics:

  1. Extremely large base traffic: Cloud computing has many tenants, large traffic, and rapid changes in user business volume. This requires L4 LB to have good scalability, be able to quickly respond to growth and changes in user business volume, and ideally reduce server procurement costs. HDSLB's horizontal scaling and multi-core horizontal scaling capabilities can better meet this demand.
  2. Frequent occurrence of elephant flows: Cloud computing is a self-service platform, and the basic network cannot predict or control when elephant flows or mouse flows will occur. Therefore, basic network elements must improve packet processing performance per CPU core as much as possible, to alleviate the problem of packet loss caused by elephant flows or bursty traffic in the "single-core single-traffic processing model" to a certain extent, as shown in the figure below.

To address the elephant flow problem, HDSLB-VPP based on Intel DLB hardware acceleration technology can achieve performance closer to line rate compared to pure software solutions in elephant flow scenarios with packet lengths of 96B, 128B, 256B, and 512B. It can be said that HDSLB's tuning of Intel CPU instruction set acceleration is among the best available today.

For resource-constrained edge computing scenarios targeting vertical industries, we need to address the following two key characteristics:

  1. High requirements for low-latency services: Edge computing mostly serves toB vertical industry users in the OT and CT fields. In these fields, business systems and proprietary network protocols have very strict requirements for network delay and jitter. HDSLB, combined with optimization for Intel E810 or IPU series network cards, can guarantee low latency and jitter resistance for data transmission at the hardware level.
  2. Strong requirement for single-machine performance at the edge: The physical space of edge computer rooms is limited, and cannot accommodate a large number of servers, so higher single-machine performance is preferred. HDSLB can more comprehensively improve single-machine performance through comprehensive tuning of multiple hardware acceleration technologies such as CPU and SmartNIC/IPU.

For performance tuning combinations in different application scenarios such as cloud computing, edge computing, telecom cloud, and network security, Intel official also provides the following configuration reference.

Development Prospect of HDSLB

In the future, HDSLB's roadmap includes the following items:

  1. Support for 100 million-level maximum concurrent TCP connections;
  2. Single CPU core throughput of more than 8Mpps, with performance that grows linearly with the number of CPU cores;
  3. New TCP connection rate of more than 800,000 CPS per CPU core, with performance that grows linearly with the number of CPU cores.
  4. Support for elephant flow processing capability based on 4th Generation Xeon-SP accelerators;
  5. Support for QoS traffic rate limiting capability
  6. Support for Anti-DDoS security capability.
  7. And more.

With the overall industry technological evolution trend of "business gateway NFVization, edge gateway hardwareization", HDSLB on one hand relies on Intel's heterogeneous computing hardware ecosystem, and on the other hand relies on the innovation capabilities of open source communities such as DPDK and VPP. With this dual approach, HDSLB is expected to be applied and promoted in more application scenarios.

Personally, I mainly focus on two aspects, including:

  1. Implementation of heterogeneous accelerator solutions, to better solve practical pain points in cloud computing and edge computing scenarios, such as intelligent perception and scheduling of elephant flows and mouse flows, industrial-grade ultra-low latency transmission, etc.
  2. Introduction of advanced hardware technologies and their application scenarios, such as new generation CPU instruction sets, memory pooling, next-generation network communication acceleration technologies, etc.

References


This is a standalone discussion topic separated from the original thread at https://juejin.cn/post/7368692841298575414