← All Posts

Kubernetes 1.36 Workload-Aware Scheduling: Gang Scheduling and Resource Optimization for AI/ML Workloads

Matthias Bruns · · 6 min read
kubernetes scheduling ai-workloads resource-management

Kubernetes 1.36 introduces significant improvements to workload-aware scheduling that fundamentally change how AI/ML and batch workloads run in production clusters. The new architecture cleanly separates concerns between the Workload API and the PodGroup API, enabling true gang scheduling and sophisticated resource optimization for distributed training jobs.

After working with distributed ML workloads on Kubernetes for years, we’ve seen too many training jobs fail because pods get scheduled across resource-constrained nodes, or worse, partially scheduled and left hanging. Kubernetes 1.36’s workload-aware scheduling finally addresses these pain points with native support for gang scheduling and topology-aware algorithms designed specifically for high-performance distributed workloads.

The Evolution from Kubernetes 1.35 to 1.36

Kubernetes v1.35 introduced the first tranche of workload-aware scheduling improvements, making workloads a first-class citizen for kube-scheduler instead of relying on custom schedulers. However, v1.35 had architectural limitations that v1.36 addresses head-on.

The key breakthrough in v1.36 is the significant architectural evolution that cleanly separates API concerns: the Workload API acts as a static template, while the new PodGroup API handles the runtime state. This separation enables more sophisticated scheduling decisions and better integration with existing Kubernetes controllers.

Kubernetes 1.36 successfully introduces a topology-aware and DRA-aware scheduling algorithm for the Kubernetes kube-scheduler, specifically designed for high-performance distributed workloads like AI/ML training. The DRA (Dynamic Resource Allocation) integration is particularly important for GPU-intensive workloads that need specific hardware configurations.

Understanding Gang Scheduling in Kubernetes 1.36

Gang scheduling solves a critical problem in distributed workloads: ensuring that all pods in a workload group are scheduled together or not at all. Without gang scheduling, you might end up with partial deployments where some pods are running while others are stuck pending, effectively wasting resources and preventing the workload from making progress.

The all-or-nothing policy is at the core of gang scheduling. The minCount field defines the quorum: at least that many pods must be schedulable together for the group to be admitted. This prevents the common scenario where distributed training jobs get partially scheduled and hang indefinitely.

The benefits extend beyond just admission control. Gang scheduling lets controllers, status reporting, future preemption behavior, and future workload-aware features reason about related pods even if those pods do not need strict all-or-nothing admission.

Configuring Workload-Aware Scheduling for AI/ML Workloads

Note: The workload-aware scheduling features in Kubernetes 1.36 are in alpha status. You’ll need to enable feature gates and understand that APIs may change in future releases.

The new architecture introduces two key APIs that work together:

Workload API (Static Template)

The Workload API defines the static configuration for your workload group. This includes resource requirements, topology constraints, and scheduling policies that don’t change during the workload’s lifecycle.

PodGroup API (Runtime State)

The PodGroup API handles the runtime state with native Job controller integration. This separation allows the scheduler to make more informed decisions about pod placement while maintaining clean separation of concerns.

Resource Optimization Strategies

For AI/ML workloads, resource optimization goes beyond simple CPU and memory allocation. You need to consider:

Topology-Aware Scheduling

The new topology-aware scheduling algorithm understands the physical layout of your cluster and can make intelligent decisions about pod placement. This is crucial for distributed training where network topology directly impacts performance.

For GPU-intensive workloads, the scheduler can now consider:

  • NUMA topology for optimal memory access patterns
  • GPU interconnect topology (NVLink, InfiniBand)
  • Network bandwidth between nodes
  • Storage locality for large datasets

DRA Integration for GPU Workloads

The DRA-aware scheduling algorithm represents a major step forward for GPU resource management. Instead of treating GPUs as simple countable resources, the scheduler can now understand GPU capabilities, memory requirements, and interconnect requirements.

This enables more sophisticated scheduling decisions like:

  • Ensuring all pods in a training job get GPUs from the same generation
  • Placing pods to maximize GPU interconnect bandwidth
  • Avoiding GPU memory fragmentation across training steps

Production Deployment Considerations

Cluster Configuration

Before deploying workload-aware scheduling in production, ensure your cluster is properly configured:

  1. Feature Gates: Enable the necessary alpha feature gates for workload-aware scheduling
  2. Scheduler Configuration: Configure the kube-scheduler to use the new scheduling algorithms
  3. Resource Discovery: Ensure proper resource discovery for GPUs and other specialized hardware

Monitoring and Observability

Workload-aware scheduling introduces new metrics and events that you should monitor:

  • PodGroup Status: Track the state of pod groups and admission decisions
  • Scheduling Latency: Monitor how long it takes to schedule workload groups
  • Resource Utilization: Track resource efficiency improvements from better scheduling

Failure Handling

Gang scheduling changes how you need to think about failure handling:

  • Partial Failures: With gang scheduling, partial failures result in the entire workload group being rescheduled
  • Resource Contention: Understand how the scheduler handles resource contention when multiple workload groups compete for the same resources
  • Preemption Behavior: The new preemption logic considers workload groups as units, not individual pods

Best Practices for AI/ML Workloads

Right-Sizing Workload Groups

Don’t make workload groups too large. While gang scheduling ensures all-or-nothing admission, larger groups are harder to schedule and more likely to fail admission. Find the right balance between coordination requirements and schedulability.

Resource Request Accuracy

With workload-aware scheduling, accurate resource requests become even more critical. The scheduler makes admission decisions based on the total resource requirements of the workload group, so underestimating resources can lead to poor performance, while overestimating reduces schedulability.

Topology Constraints

Use topology constraints judiciously. While they can significantly improve performance for distributed workloads, overly restrictive constraints can make workloads unschedulable in smaller clusters.

Migration from Custom Schedulers

Many organizations currently use custom schedulers like Volcano or YuniKorn for gang scheduling. Kubernetes 1.36’s native support provides a migration path, but consider:

Feature Parity

Evaluate whether the native workload-aware scheduling provides all the features your current custom scheduler offers. Some advanced features may still require custom schedulers.

Gradual Migration

Plan a gradual migration strategy. You can run both scheduling systems in parallel during the transition period, scheduling different workload types with different schedulers.

Monitoring and Validation

Implement comprehensive monitoring to validate that the native scheduler performs as well as your custom solution for your specific workloads.

Future Outlook

The workload-aware scheduling improvements in Kubernetes 1.36 represent just the beginning. The clean API separation between Workload and PodGroup opens possibilities for future enhancements like:

  • More sophisticated preemption policies
  • Advanced resource sharing strategies
  • Better integration with cluster autoscaling
  • Enhanced support for multi-tenant workload scheduling

Conclusion

Kubernetes 1.36’s workload-aware scheduling represents a significant step forward for AI/ML workloads in production environments. The combination of gang scheduling, topology-aware algorithms, and DRA integration addresses long-standing pain points in distributed workload management.

While these features are still in alpha, they provide a clear path toward native support for complex workload scheduling requirements. Organizations running AI/ML workloads should start evaluating these capabilities and planning migration strategies from custom schedulers.

The architectural improvements in v1.36 create a solid foundation for future enhancements, making this release a turning point for workload-aware scheduling in Kubernetes. For production AI/ML workloads, the investment in understanding and adopting these new capabilities will pay dividends in improved resource utilization, reduced job failures, and simplified cluster management.

Reader settings

Font size