1. Background
At 10:04 on July 2, 2024, the public physical optical fiber of Computer Room A of our station was cut off, resulting in the public network of Computer Room A becoming unreachable. This article analyzes the problems we identified during this outage and our governance and optimization measures from the perspective of DCDN architecture and multi-active governance.
2. Damage Mitigation Process
After the outage occurred, SRE and network engineers received a large number of alerts for dedicated line outages and public network probe failures, and quickly launched an online meeting to collaborate on fault location and damage mitigation operations;
During this period, core services (such as homepage recommendation, playback, etc.) were not affected, because automatic disaster recovery at the origin computer room level was configured and took effect on the DCDN side;
First, we located that a single carrier's line had a large amount of abnormal packet loss. We prioritized switching this carrier's user traffic to CDN dedicated line nodes with dedicated line origin pull, and this part of user traffic recovered after that, but the overall service did not fully recover;
We continued troubleshooting and found that the entire public network of Computer Room A was completely unreachable. For core service scenarios in Computer Room B, automatic disaster recovery had already taken effect, and we observed that its traffic was increasing while service SLO remained normal, so we decided to switch all multi-active services to Computer Room B for damage mitigation. At this point, damage mitigation for multi-active services was completed, while non-multi-active services still remained impaired;
We continued to perform degradation on non-multi-active service traffic, switching user traffic to origin pull via CDN dedicated line nodes, and completed damage mitigation for non-multi-active service traffic at this point.
3. Problem Analysis
Figure 1: North-south traffic architecture diagram / 0702 outage logic diagram Figure 2: B2-CDN ring network schematic diagramLet us first briefly introduce Bilibili's origin architecture. As can be seen from Figure 1 above, Bilibili's online services have two core computer rooms, each with two Internet access points (public network POP), and these two Internet access points are located in different provinces and cities. The core idea of this design is to decouple network access (collectively referred to as POP below) and computing power centers (collectively referred to as computer rooms below), so as to achieve the effect of disaster recovery for access layer failures.
Meanwhile, as can be seen from Figure 2, to improve the stability and efficiency of origin pull from self-built CDN nodes to core origin computer rooms, we have completed the design and construction of the B2-CDN ring network, which allows edge L1 & L2 self-built CDN nodes to perform origin pull through this ring network, enriching the paths for services to obtain data from edge nodes back to the core origin. In fact, the original design intention of the B2-CDN ring network was to provide more methods for L1 & L2 self-built CDN nodes when processing edge cold, warm, and hot traffic, so as to explore an edge network scheduling method more suitable for Bilibili's service characteristics. The underlying B2-CDN ring network uses Layer 2 MPLS-VPN technology to achieve Full-Mesh between all nodes, and on this basis, uses Layer 3 routing protocols (OSPF, BGP) to achieve interconnection between each node and the core origin computer room. At the same time, all services retain the ability to pull origin from the core computer room via the public network, which serves as a fallback origin pull scheme for extreme outage scenarios of the B2-CDN ring network.
Bilibili's interface requests are mainly accelerated back to the origin via DCDN. DCDN nodes are divided into two types: public network nodes that pull origin via the public network, and dedicated line nodes that pull origin via dedicated lines. Under normal circumstances, DCDN public network nodes can return to the origin through dual public network POPs, while DCDN dedicated line nodes return to the origin through internal network dedicated lines. At the DCDN level, there is a Health Check function for origin servers, which automatically removes origin IPs that fail probes. For example, when a DCDN node's origin request to POP A encounters an exception, it will retry to POP B. Under normal operation, DCDN public network nodes perform cross-origin pull via dual POPs, which can handle packet loss or outages at one origin POP, and the disaster recovery scheme takes effect automatically with almost no impact on services.
However, in this outage, both POPs connected to Computer Room A failed, which is equivalent to the public network of Computer Room A going offline. Unlike a single POP outage, the conventional mutual disaster recovery scheme between dual POPs cannot take effect. Except for several core service scenarios that were not affected because computer room-level disaster recovery strategies were pre-configured, non-automatically disaster-recovered multi-active services require computer room-level traffic switching to achieve damage mitigation. Since DCDN dedicated line nodes can perform origin pull via the B2-CDN ring network dedicated line and were not affected by this outage, they ultimately became the escape path for non-multi-active services.
Looking back at the entire damage mitigation process, we found the following problems:
-
Extreme network outages of a computer room have slow fault delimitation and incomplete contingency plans
-
Some multi-active services still require manual traffic switching for damage mitigation. Can this process be made faster, or even achieve automatic damage mitigation?
-
How can non-multi-active services actively escape when facing failures at computer room entrances and exits?
4. Optimization Measures
In response to the problems encountered in this outage, we re-evaluated the contingency plans and improvement measures for single computer room outages. We found that the overall damage mitigation plan for multi-active services is consistent, with a focus on the effectiveness of automatic disaster recovery and the efficiency of manual traffic switching; while non-multi-active services need to have multiple escape methods: origin pull via DCDN internal network nodes, or cross-computer-room forwarding via API gateway.
Contingency Plan for Extreme Computer Room Network Outages
As mentioned above, the origin has three entrances: dual public network POPs plus a dedicated line, so logically, if any two entrances are abnormal, there is still an opportunity to ensure service availability. Therefore, we have implemented the following measures:
-
Expand the computing power and scale of DCDN dedicated line nodes to maximize the carrying capacity under extreme scenarios;
-
Formulate a scheduling contingency plan for abnormal dual public network POP egress. We group domain names and DCDN node types, and support quickly switching non-multi-active domain names to dedicated line nodes; since multi-active domain names can achieve damage mitigation via traffic switching, they do not need to be scheduled to dedicated line nodes to avoid additionally increasing the load on dedicated line nodes;
-
Improve the efficiency of fault delimitation: optimize the reporting link for important monitoring, decouple it from service links, and deploy it for disaster recovery on public cloud; at the same time, optimize the network topology panel to clearly display the status of each link; as well as optimize alarm and display methods to facilitate rapid problem location.
Continuous Promotion of Multi-active Construction and Regular Drills
Figure 4: Schematic diagram of same-city multi-active architectureCurrently, our station's services mainly adopt a same-city multi-active architecture. As shown in Figure 4, we logically divide multiple computer rooms into two availability zones, each of which bears 50% of the traffic under normal operation. If we divide the overall multi-active architecture into layers:
-
Access layer:
-
DCDN: North-south traffic management, routes to origin computer rooms in different availability zones based on hash of user dimension information, supports automatic disaster recovery at the availability zone level;
-
Layer 7 load balancing / API gateway: North-south traffic management, supports interface-level routing, timeout control, same/cross-availability zone retry, circuit breaking, rate limiting & client flow control, etc.;
-
Service discovery / service governance components: East-west fine-grained traffic management, the framework SDK supports priority calling within the same availability zone, and service/interface-level traffic scheduling;
-
Cache layer: Mainly Redis Cluster, Memcache, provides Proxy components for access. It does not support cross-availability zone synchronization, so it needs to be deployed independently in dual availability zones; it maintains eventual data consistency by subscribing to database Binlog, and needs to be transformed for pure cache scenarios;
-
Message layer: In principle, production/consumption is closed within the availability zone, supports Topic-level two-way message synchronization across availability zones, and three consumption modes: Local/Global/None to adapt to different service scenarios;
-
Data layer: Mainly MySQL, KV storage, uses master-slave synchronization mode; provides Proxy components for service access, supports multi-availability zone readable, nearby read, routes write traffic to the primary node, force read from primary, etc.;
-
Control layer: Invoker multi-active control platform, supports multi-active metadata management, north-south/east-west traffic switching, DNS switching, contingency plan management, and multi-active risk inspection;
For services that have completed multi-active transformation, we have built a multi-active control platform to uniformly maintain service multi-active metadata, and support north-south and east-west multi-active traffic switching control. The platform side supports maintenance of traffic switching contingency plans, and enables fast traffic switching for single services, multiple services, and full-site operations. At the same time, the platform provides multi-active related risk inspection capabilities, regularly inspects risks from the perspectives of multi-active traffic ratio, service capacity, component configuration, cross-computer-room calls, etc., and supports governance and operation of related risks.
After completing pre-contingency plan maintenance and risk governance, we regularly conduct north-south traffic switching drills for single services and combinations of multiple services, to verify resource load conditions such as capacity and rate limiting of the service itself, its dependent components, and its dependent downstream services, to regularly ensure the effectiveness of multi-active deployment, and maintain the capability for switching and disaster recovery at any time.
Computer Room-level Automatic Disaster Recovery
For core services involved in highly user-perceivable scenarios, we configure computer room-level disaster recovery strategies for origin servers on the DCDN side. When a failure occurs at the entrance of a single origin computer room, traffic can be automatically routed to another computer room to achieve damage mitigation.
Previously, automatic disaster recovery for multi-active services was not enabled by default for all services; we prioritized guaranteeing core scenarios such as homepage recommendation and playback, while other service scenarios perform traffic switching based on resource pool water level. Currently, the average CPU utilization of our resource pools has exceeded 35%, and the average peak CPU utilization of online services is close to 50%. We have sorted out the resource requirements for switching all site services to a single computer room. At the same time, multi-active traffic switching will also coordinate with the platform to adjust HPA policies, and prepare a rapid elasticity contingency plan for resource pools to ensure the health of overall resources. We will subsequently support the configuration of automatic disaster recovery policies for more highly user-perceivable scenarios such as community interaction, search, and user space. When a computer room-level outage occurs, no manual intervention is required, and multi-active services can directly complete disaster recovery and damage mitigation.
Figure 5: North-south traffic architecture for multi-active services: Normal state / Disaster recovery stateNon-multi-active Traffic Escape
Some services are currently not deployed in a multi-computer room multi-active architecture, and can only handle traffic in one computer room. Therefore, in the original scheme, this part of non-multi-active service traffic can only pull origin to Computer Room A, and cannot handle failures of the public network entrance of Computer Room A. Just like in this outage, non-multi-active service traffic could not be switched for damage mitigation, and had to rely on degradation to route through CDN dedicated line nodes.
To handle scenarios such as single computer room public network entrance failure, Layer 4 load balancing failure, and Layer 7 load balancing failure, we plan to also configure origin-level automatic disaster recovery rules for non-multi-active services on the DCDN side, merge and unify the routing configurations of multiple computer rooms and multiple clusters on Layer 7 load balancing SLB, to ensure that non-multi-active service traffic can be routed to the API gateway via Computer Room B during an outage; the API gateway will determine whether an interface is multi-active, and non-multi-active interfaces will forward traffic via internal network dedicated lines to achieve traffic escape.
Figure 6: North-south traffic architecture for non-multi-active services: Normal state / Disaster recovery state5. Summary
A single computer room-level outage is a great test of the completeness and effectiveness of multi-active transformation, and it must be verified through outage drills. In the second half of this year, we will continue to focus on multi-active risk governance. In addition to regular traffic switching drills, we will also launch network outage drills for north-south and east-west traffic. We will also share special content on multi-active governance and drills in the future, so stay tuned!
-End-
Author丨SRE Team, Network Team
This is a discussion topic separated from the original topic at https://www.bilibili.com/read/cv36728096/







