ByteDance Cloud-Native Protection System Practice


Theme: Arknights

Background

As Kubernetes is widely adopted and deployed in enterprises, a layered technical architecture of "Business - Middle Platform - Infrastructure" has gradually emerged. This layered architecture shields the complex concepts of the platform and infrastructure layers, allowing applications to focus on business layer development, but it also means that the stability of upper-layer applications strongly relies on support from the underlying infrastructure, posing significant challenges to the stability of infrastructure in large-scale clusters:

  • Due to the large cluster scale, any small, seemingly insignificant problem can be amplified infinitely, leading to systemic risks;
  • The complexity and diversity of scenarios also make it difficult to completely avoid unexpected operational and maintenance actions.

This requires us to implement more effective extreme risk protection for the resources and objects managed by Kubernetes, to mitigate as much as possible the irreversible impact on businesses caused by human errors, component version and configuration mistakes, or control plane code bugs.

Although Kubernetes natively provides a series of protection mechanisms, such as strict RBAC verification mechanisms, using PodDisruptionBudget (PDB) to validate the Eviction API, and a rich set of Admission Plugins, we still found many scenarios that are not covered in actual production practice.

Against this background, ByteDance internally extended and transformed the Kubernetes system, adding a series of defensive verification measures and operational constraints to provide stronger stability support for businesses running on Kubernetes and reduce extreme risks.

Hardening

Kubernetes is a rather complex distributed system, but its core architectural design idea is very simple. Kubernetes provides a unified API interface via APIServer to enable access and modification of cluster state; various automated components can communicate with the cluster in a standardized way to continuously obtain data, calculate the difference between the current cluster state and the expected cluster state locally, and derive a series of change operations; finally, kubelet executes these state changes on each node to push the cluster toward the expected state.

As can be seen, the interaction and operational status between Kubernetes components can be roughly divided into the following three layers

  • Interaction between KV storage systems (such as etcd, Kine, Kubebrain) and apiserver, providing key-value level read/write operations and event listening;
  • Interaction between apiserver and various built-in or additional controller/operator (and between apiserver and users) via API requests;
  • Interaction between apiserver and single-node components.

Based on the above layers, we can systematically sort out a series of common systemic risks and take corresponding measures to harden the system and reduce extreme risks.

Data Protection

The interaction risks between storage and apiserver mainly focus on data anomalies, such as data corruption and loss; the storage system is the core of Kubernetes and the cornerstone of the entire event-driven distributed system. Once data anomalies occur, a series of faults may be directly or indirectly derived. Specifically, common extreme risk issues include but are not limited to the following:

  • Operational errors in the storage cluster cause the storage to go offline, making the entire Kubernetes cluster unavailable;
  • Administrators directly delete data in etcd without going through apiserver verification, which may cause some unexpected key objects such as Namespaces, Deployments, Pods, etc. to be deleted directly, triggering cascading deletion of objects and causing large-scale business losses;
  • Administrators directly modify data in etcd due to human error, damaging the data format and causing apiserver to fail to decode the data.

To address these issues, we have taken a series of measures in the production environment: First, standardize the constraints on storage cluster operation and maintenance and data operations as much as possible, enable TLS mutual authentication on the storage system side, and avoid direct access to storage by users other than Kubernetes as much as possible to reduce the risk of data corruption or loss; Second, perform regular backups of the storage. In extreme cases, when irreversible data loss occurs, the backup can be used to quickly restore data and reduce the impact of losses; In addition, by hardening other components, we can minimize the direct impact of unexpected events derived from data anomalies on businesses.

Control Plane Protection

The interaction risks between automated components and apiserver mainly focus on unexpected operations. Under normal circumstances, users or platforms submit the expected state to apiserver, and other internal components will immediately derive a series of actions based on the difference between the current state and the expected state, causing the cluster to change; once an incorrect expected state is submitted, the cluster will quickly and irreversibly change toward the target state.

The main protection idea for this type of problem is to add some additional restrictions on the operations of key objects, such as requiring additional redundant operations during the operation to form a double-check mechanism, reducing the probability of risks caused by human errors or control plane code bugs; Specifically, operation protection is implemented via the ValidatingAdmissionWebhook extension mechanism natively provided by Kubernetes. We use labels and annotations to mark key objects that require operation protection, and use selector configuration to filter these key objects and their corresponding operations, and implement a series of constraints in the Webhook to achieve the protection purpose, including but not limited to the following strategies:

  • Prevent Cascading Deletion For root objects such as Namespaces and CRDs, once deleted, they will trigger cascading deletion of other derived objects. Therefore, we intercept the deletion of these types of key objects in the Webhook to avoid catastrophic consequences caused by cascading deletion operations triggered by human error.
  • Explicit Replica Modification When adjusting the number of replicas for key workload resources, to avoid accidentally reducing the number of replicas to 0, we require that while adjusting the replica count via UPDATE or PATCH requests, a specific annotation must be explicitly added to the object to write the expected adjusted value as a double check; In the Webhook, we verify whether the value in the .spec.replicas field when the key workload object is changed matches the value provided in the annotation, ensuring that any modification to the number of key workload replicas is intentional and explicit.
  • Explicit Resource Deletion When deleting key workload objects, it is required to first reduce the workload's replica count to 0 via a modification operation before deleting the object; Through this constraint, we can avoid some human errors, such as directly deleting certain key workload objects without confirmation, which may trigger more cascading deletion operations and cause business losses.
  • Operation Window Constraints For certain specific businesses, there are strict change event window restrictions on business specification changes. For example, businesses only accept changes to configurations such as images and environment variables during non-peak hours, which can reduce potential problems caused by specification changes and corresponding business interruption risks. We define constraints such as changeable windows and changeable fields via CRDs and expose them to users, and perform corresponding verification in the Webhook based on user configurations. This can ensure that when a fault occurs, it affects as few end users as possible, ensures relatively sufficient fault handling time, minimizes potential losses, and reduces system risks.

In addition, online production environments often encounter client anomalies, such as OOM, a large number of cache penetration issues, etc. These anomalies often trigger a large number of highly resource-intensive read requests, causing control plane anomalies or even avalanches. To address the protection of online abnormal traffic, we have imposed certain restrictions on user behavior, prohibiting some highly resource-intensive read penetration behaviors. Secondly, we have deployed a dedicated seven-layer gateway KubeGateway in front of the control plane, customized for the traffic characteristics of kube-apiserver. It solves the problem of unbalanced load of kube-apiserver, and at the same time realizes complete governance of kube-apiserver requests, including request routing, traffic splitting, current limiting, degradation, etc., significantly improving the availability of Kubernetes clusters. In addition, we have extended Kubernetes audit logs, attaching some traffic-related information to the audit logs, and analyzed them to obtain user profiles. In abnormal scenarios, combining user profiles, traffic monitoring metrics, and the current limiting capability of the seven-layer gateway KubeGateway deployed in front of the control plane, we perform traffic control on clients that exert excessive pressure on the control plane, reducing the avalanche risk as much as possible.

Node Protection

In most scenarios, Pod deletion should be performed in two stages: First, the centralized Controller or user marks the Pod as deleted by sending a Delete request (i.e., adding a DeletionTimestamp), then kubelet is responsible for initiating a graceful shutdown of the business. After the business terminates and resources are released, kubelet will completely remove the Pod via the interface provided by APIServer. However, in production practice, we have encountered many problems that may cause kubelet to terminate business Pods unexpectedly due to anomalies, such as:

  • Due to configuration errors or code bugs, kubelet rejects running business Pods after restarting, causing business losses;
  • Due to data corruption or other anomalies in the control plane storage, kubelet finds that the locally running Pods do not match the Pods that should be running locally as provided by the control plane, causing unexpected business exits.

To address these issues, we have carried out a series of transformations to kubelet, covering links such as admit and housekeeping. Through the transformation, we added pre-constraints to the kubelet Pod deletion operation: when attempting to delete a key Pod, first check whether the Pod has been explicitly marked for deletion. If the Pod has not been marked for deletion, kubelet is not allowed to trigger the Pod deletion operation. Based on this explicit deletion constraint, we have significantly reduced the node-level business operation risks caused by various Kubernetes component anomalies.

Summary

In the production environment, we mainly identify and sort out key risks based on the interaction process between Kubernetes components, mark key objects via specific labels and annotations, and take corresponding measures to harden the system:

  • Data Protection Mainly constrains operation and maintenance operations, converges data access entrances, and standardizes various storage operation behaviors to reduce risks;
  • Control Plane Protection Mainly extends via customized ValidatingAdmissionWebhook, requiring the active introduction of redundant operations and verifications during the modification of some key objects to reduce the risk of human error;
  • Node Protection Mainly transforms kubelet to strictly require that key Pods must be explicitly deleted, reducing systemic risks in extreme scenarios.

Application Cases

ByteDance has customized many functions based on the native Kubernetes ecosystem to support personalized scenarios. The overall R&D, iteration, and delivery efficiency are very high, which poses greater challenges to cluster stability. Even with strict control over delivery process specifications, extreme abnormal risks in abnormal scenarios cannot be completely eliminated; Combining the fault cases and scenario requirements encountered in the practice process, the ByteDance Cloud Native team has built a relatively comprehensive defense system from multiple perspectives such as meta-cluster, control plane, data plane, and business customization, effectively avoiding large-scale online accidents.

Data Protection: Meta-Cluster Cascading Deletion

ByteDance has a large number of internal clusters. To achieve automated operation and maintenance and cluster management, it is necessary to build a meta-cluster to describe the state of business clusters; In this case, anomalies in the meta-cluster itself may trigger larger-scale faults. In the early days of ByteDance, clusters lacked protection capabilities. An SRE used excessive permissions during operation and maintenance and accidentally deleted a CRD used to describe Node status in a certain region's meta-cluster. Without a defense system to intercept it, the deletion of the CRD would trigger cascading deletion of all CRs, causing the meta-cluster controller to believe that almost all nodes need to be taken offline, leading to a full physical shutdown of Pods. This fault eventually caused a single-region production cluster to continuously mark and delete 30,000+ nodes within 30 minutes. After actually deleting 9,000 nodes, the loss was stopped in time, with a huge impact and a very short manual stop window. In this case, accessing the defense system can realize defense capabilities at multiple points:

  • Pre-interception: Mark CRDs as critical to avoid cascading problems caused by full-scale accidental deletion;
  • Cluster Offline Current Limiting: Large-scale cluster offline is not a common operation and maintenance operation. Control the frequency and safety water level of node offline to ensure that even if abnormal cascading deletion behavior occurs, the fault domain can be controlled as much as possible;
  • Data Backup and Recovery: When physical objects are deleted, rapid recovery can be achieved via backup data.

Control Plane Protection: Abnormal Traffic Identification and Current Limiting

Control plane anomalies usually originate from unreasonable client behavior and inaccurate server resource estimation. Due to the complexity of scenarios, in the absence of fine-grained governance, the server will eventually be overloaded due to various reasons; Usually, the phenomenon is accompanied by a large number of List requests from clients and APIServer OOM, further triggering full client Relist, creating a vicious cycle until the cluster avalanches. For extreme anomalies in the control plane, ByteDance internally accesses a 7-layer gateway, combined with full-link automated traffic tracing, to achieve flexible and intelligent API request protection

  • Normal Current Limiting: Customize current limiting rules based on the combination of clients and resource objects and normal traffic analysis to avoid pressure on the server from instantaneous large numbers of requests;
  • Disaster Recovery Scene Circuit Breaking: When the cluster has obvious anomalies or avalanches, perform manual circuit breaking to stop losses, and gradually release current limiting to restore the cluster to normal;

Node Protection: Large-Scale Eviction Triggered by Abnormal Version Upgrade

Compared with the control plane, the versions and configurations of the data plane are usually more complex and diverse, and iterations are usually more frequent, making it easier to trigger unexpected extreme risks due to improper component operation and maintenance operations. During an upgrade of the Kubelet version, an SRE applied an unexpected co-location resource configuration. After Kubelet restarted, a large number of running Pods were deleted due to failed admission caused by incorrect resource ownership recognition. At the same time, the native delete API is not intercepted by PDB, which was expected to cause a large amount of business capacity loss; However, due to the deployed protection capabilities, no serious online problems were eventually caused. In this case, accessing the defense system can provide defense capabilities on both a single-node and central level:

  • Single-Node Interception: For core services that are already in the Running state, add the explicit-deletion label by default to ensure that only explicit API-based deletion (setting deletionTimestamp) is allowed. This ensures that after an abnormal data plane release, the operation of business instances is not affected, providing sufficient time for manual intervention;
  • Central Interception: Add verification for both Delete and DeleteCollection APIs for core services to avoid similar unexpected Pod deletion behaviors from affecting businesses;

Future Plans

ByteDance's protection practices will be gradually integrated into the Volcano Engine VKE product in the future, providing more reliable stability guarantees for cloud services; In addition, we will continue to enhance the functional features of cloud-native protection, converge and resolve more scenarios that may cause stability risks to cloud services, including the following:

  • Control Plane Delete Pod API Protection The built-in PDB protection mechanism only applies to the Evict Pod API, and the verification performance is poor. When there are a large number of PDB objects, the time consumption of the Evict Pod API will be significantly degraded, and the request latency far exceeds that of Delete Pod. Therefore, many components deliberately do not use Evict Pod and directly use Delete Pod, such as the scheduler initiating preemption. Since there are few built-in verifications for the control plane Delete Pod, directly using this interface easily causes the health ratio of business Pods to be lower than expected, affecting normal business operations. To avoid this type of risk, we need to optimize the performance of Evict Pod on the one hand, and on the other hand, expand more strict verification for the Delete Pod operation to ensure that the health ratio of business running Pods is not lower than expected.
  • Converge Static Verification Strategies The current protection work we do on the control plane mainly relies on the Validating Admission Webhook mechanism. On the one hand, this will introduce additional external processes for apiserver during request processing, increasing latency and error probability; on the other hand, it will also increase the complexity of cluster operation and maintenance to a certain extent. In Kubernetes version 1.26, a new Admission Plugin was introduced, which supports using CEL (Common Expression Language) to perform some static verification on requests. In the future, we will migrate some redundant operation verifications of the control plane protection to CEL to improve the above issues.
  • Scenario-Customized Protection Strategies For businesses with storage state such as Redis and distributed training, there are many customization requirements for their orchestration models and operation and maintenance solutions. To this end, the defense system needs to supplement and improve more refined strategies to match their unique extreme abnormal risks based on their business characteristics (such as storage sharding, vertical resource adjustment, in-place restart, etc.).

Conclusion

This article mainly introduces the main system risks discovered during the application of Kubernetes in ByteDance's internal production environment and a series of proposed protection measures. Specifically, starting from the perspective of the interaction process between Kubernetes components, we divided them into three layers: data, control plane, and node, and illustrated common problems through specific examples, including human errors and control plane component version errors, etc. And for these common problems, we briefly introduced a series of defensive measures we built, including but not limited to constraining component access permissions, actively adding redundant operations and related verifications, etc. Through these defensive measures, we can reduce the risks brought to businesses by known problems and provide stable basic services for businesses.

In addition to necessary defensive hardening measures, standardized change processes during daily cluster maintenance are also crucial. By controlling the cluster scale and fully conducting gray-scale verification, the impact scope of faults can be reduced. In the production environment, only by comprehensively using various means such as system self-defensive measures and standardized operation and maintenance can risks and fault losses be minimized to the greatest extent.


This is a discussion topic separated from the original topic at https://juejin.cn/post/7359954143916736547