Gang Scheduling on AKS: Volcano vs Kueue vs KAI Scheduler

8 minute read · Updated: March 24, 2026

Introduction

In my previous post, I deployed NVSentinel for GPU fault detection on AKS. That post assumed workloads were already scheduled — but how they get scheduled matters just as much, especially for multi-node GPU training.

The default Kubernetes scheduler has a fundamental problem for distributed training: it places pods independently. If you submit a 4-node training job, the scheduler might place 3 pods immediately and leave the 4th pending for hours — burning GPU-hours on the 3 idle workers. In SLURM, this never happens: srun --nodes=4 waits until all 4 nodes are available, then starts everything simultaneously. This is gang scheduling.

Kubernetes doesn’t have gang scheduling built in. Three open-source projects fill that gap: Volcano (CNCF, started at Huawei), Kueue (Kubernetes SIG, Google), and KAI Scheduler (NVIDIA, ex-Run:ai). Each takes a different architectural approach.

In this post, I install all three on the same AKS cluster with H100 GPU nodes and submit the same job through each. The goal isn’t a benchmark — the scheduling overhead is nearly identical (~8–9 seconds). The goal is to compare the user experience: how you submit jobs, how you check status, how the queue systems work, and what tradeoffs each scheduler makes.

Test Environment

Component	Detail
Cluster	Azure Kubernetes Service (AKS), Kubernetes v1.33.7
GPU Node Pool	2× Standard_ND96isr_H100_v5 (8× H100 per node, 16 GPUs total)
System Node Pool	2× Standard_D4ads_v5
GPU Operator	NVIDIA GPU Operator (driver pre-installed on AKS)
Volcano	v1.14.1 (Helm chart)
Kueue	v0.16.4 (manifest install)
KAI Scheduler	v0.13.4 (Helm chart)

All three schedulers run simultaneously on the same cluster. Each uses a different schedulerName, so there’s no conflict — a pod specifies which scheduler should place it.

How Each Scheduler Works

Before looking at the test results, it helps to understand the architectural difference:

VOLCANO:
  Pod created → volcano scheduler picks it up → places it
  (replaces kube-scheduler for those pods)
  Gang: PodGroup CRD tracks which pods must start together

KUEUE:
  Job created with suspend: true → Kueue holds it in queue →
  when quota allows, Kueue unsuspends → default kube-scheduler places pods
  (works WITH the default scheduler, not instead of it)
  Gang: all pods start when job is unsuspended

KAI:
  Pod created → kai-scheduler picks it up → places it
  (replaces kube-scheduler for those pods)
  Gang: pod-grouper auto-creates PodGroups

The key distinction: Kueue is a queue manager, not a scheduler. It decides when a job should start, but the default Kubernetes scheduler decides where pods go. Volcano and KAI replace the scheduler entirely — they control both when and where.

The Test Job

A simple 2-pod job, each requesting 1 GPU, running nvidia-smi and sleeping 30 seconds. Identical workload, three different submission methods.

Volcano

Volcano uses the standard Kubernetes batch/v1 Job with a pod annotation to assign the queue:

apiVersion: batch/v1
kind: Job
metadata:
  name: volcano-test
spec:
  parallelism: 2
  completions: 2
  completionMode: Indexed
  template:
    metadata:
      annotations:
        scheduling.volcano.sh/queue-name: "default"
    spec:
      schedulerName: volcano      # ← Volcano's scheduler
      containers:
        - name: gpu
          image: nvidia/cuda:12.6.0-base-ubuntu22.04
          command: ["sh", "-c", "nvidia-smi -L && sleep 30"]
          resources:
            limits:
              nvidia.com/gpu: "1"
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      restartPolicy: Never

Kueue

Kueue uses a standard Job with a label and suspend: true:

apiVersion: batch/v1
kind: Job
metadata:
  name: kueue-test
  labels:
    kueue.x-k8s.io/queue-name: gpu-queue  # ← Kueue queue assignment
spec:
  parallelism: 2
  completions: 2
  completionMode: Indexed
  suspend: true                            # ← Kueue will unsuspend when ready
  template:
    spec:
      # No schedulerName — uses default kube-scheduler
      containers:
        - name: gpu
          image: nvidia/cuda:12.6.0-base-ubuntu22.04
          command: ["sh", "-c", "nvidia-smi -L && sleep 30"]
          resources:
            limits:
              nvidia.com/gpu: "1"
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      restartPolicy: Never

KAI Scheduler

KAI uses a standard Job with a queue label and schedulerName:

apiVersion: batch/v1
kind: Job
metadata:
  name: kai-test
  labels:
    kai.scheduler/queue: team-training     # ← KAI queue assignment
spec:
  parallelism: 2
  completions: 2
  completionMode: Indexed
  template:
    metadata:
      labels:
        kai.scheduler/queue: team-training # ← must also be on pod template
    spec:
      schedulerName: kai-scheduler         # ← KAI's scheduler
      containers:
        - name: gpu
          image: nvidia/cuda:12.6.0-base-ubuntu22.04
          command: ["sh", "-c", "nvidia-smi -L && sleep 30"]
          resources:
            limits:
              nvidia.com/gpu: "1"
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      restartPolicy: Never

Results

Each scheduler was installed in isolation — install, test, uninstall, pause 10 seconds, repeat — to ensure no interference. Same job, same cluster, clean state each time.

	Volcano	Kueue	KAI
Schedule time	~9s	~7s	~8s
Total time	~37s	~36s	~35s
Worker 0 node	vmss000001	vmss000001	vmss000000
Worker 1 node	vmss000000	vmss000001	vmss000000
Placement strategy	Spread	Bin-packed	Bin-packed

Scheduling overhead is nearly identical — 7–9 seconds from kubectl apply to pods running. The 30-second sleep dominates the total time.

The placement difference is the real finding:

Volcano spread the pods across both GPU nodes (one per node)
Kueue placed both pods on the same node (vmss000001)
KAI placed both pods on the same node (vmss000000)

This is consistent and reproducible — not an artifact of running schedulers side-by-side. Each scheduler was the only scheduler on the cluster during its test.

For distributed training with FSDP/NCCL, you typically want pods spread across nodes (to use all available GPUs and InfiniBand links). For independent inference jobs, bin-packing is better (keeps one node free for other work). Both Kueue and KAI support spread scheduling via topology-aware features, but their defaults favor bin-packing. Volcano spreads by default.

What About Real Training Jobs?

The 1-GPU-per-pod test revealed default placement preferences. But in practice, LLM training requests all 8 GPUs per node — and at that point, the scheduler must spread because two 8-GPU pods can’t fit on the same 8-GPU node.

To confirm, I ran Qwen2.5-7B fine-tuning with 8 GPUs per worker through Kueue:

resources:
  limits:
    nvidia.com/gpu: "8"   # all GPUs on the node

Pod placement:
qwen-finetune-kueue-0   aks-gpupool-09344442-vmss000001   8 GPUs
qwen-finetune-kueue-1   aks-gpupool-09344442-vmss000000   8 GPUs

Both workers land on separate nodes — the scheduler has no choice. Training completed in ~57 seconds per worker (5 steps, Qwen2.5-7B, bf16, wikitext-2):

{'loss': 0.7141, 'grad_norm': 46.75, 'learning_rate': 5e-05, 'epoch': 0.31}
{'loss': 0.3203, 'grad_norm': 28.5, 'learning_rate': 4e-05, 'epoch': 0.62}
{'loss': 0.0184, 'grad_norm': 1.546875, 'learning_rate': 3e-05, 'epoch': 0.92}
{'train_runtime': 56.4s, 'train_samples_per_second': 2.836}
[Kueue] Rank 0 finished training in 56.6s

The bin-pack vs spread distinction only matters for partial-GPU jobs (inference, dev notebooks, small experiments). For full-node training — the primary use case for H100 clusters — all three schedulers behave identically.

Queue Setup Comparison

Before submitting jobs, each scheduler needs queue infrastructure. The setup complexity varies:

Volcano — Minimal

Volcano creates a default queue automatically during Helm install. No additional setup needed for basic use.

# Already exists after helm install
kubectl get queues.scheduling.volcano.sh
# NAME      PARENT
# default   root
# root

Kueue — Most Verbose

Kueue requires three CRDs: a ResourceFlavor (what kind of hardware), a ClusterQueue (capacity limits), and a LocalQueue (namespace-scoped entry point):

# 1. What hardware flavors exist
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-h100
spec:
  nodeLabels:
    nvidia.com/gpu.present: "true"

# 2. Cluster-wide capacity
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-cluster-queue
spec:
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: gpu-h100
          resources:
            - name: "nvidia.com/gpu"
              nominalQuota: 16

# 3. Namespace entry point
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: gpu-queue
  namespace: default
spec:
  clusterQueue: gpu-cluster-queue

More setup, but the separation provides fine-grained multi-tenant control.

KAI — Hierarchical

KAI uses a two-level queue hierarchy (parent + child):

apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
  name: department-gpu
spec:
  resources:
    gpu:
      quota: 16

---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
  name: team-training
spec:
  parentQueue: department-gpu
  resources:
    gpu:
      quota: 16

Command Cheat Sheet

Operation	Volcano	Kueue	KAI
Submit	`kubectl apply -f job.yaml`	`kubectl apply -f job.yaml`	`kubectl apply -f job.yaml`
Queue assignment	Pod annotation: `scheduling.volcano.sh/queue-name`	Job label: `kueue.x-k8s.io/queue-name`	Pod label: `kai.scheduler/queue`
Scheduler	`schedulerName: volcano`	(default scheduler)	`schedulerName: kai-scheduler`
Gang mechanism	`minAvailable` or auto PodGroup	`suspend: true` → unsuspend	Auto PodGroup via pod-grouper
Job status	`kubectl get vcjob` or `kubectl get job`	`kubectl get workloads`	`kubectl get job`
Queue status	`kubectl get queues.scheduling.volcano.sh`	`kubectl get clusterqueues`	`kubectl get queues.scheduling.run.ai`
Pod logs	`kubectl logs -l job-name=<name>`	`kubectl logs -l job-name=<name>`	`kubectl logs -l app=<name>`
Cancel	`kubectl delete job <name>`	`kubectl delete job <name>`	`kubectl delete job <name>`

SLURM Equivalents

For those coming from SLURM (like me), here’s the mental mapping:

SLURM	Volcano	Kueue	KAI
`sbatch`	`kubectl apply -f job.yaml`	same	same
`squeue`	`kubectl get vcjob`	`kubectl get workloads`	`kubectl get pods -l app=<name>`
`scancel`	`kubectl delete job`	same	same
`scontrol show job`	`kubectl describe vcjob`	`kubectl describe workload`	`kubectl describe job`
Partition	Queue (`scheduling.volcano.sh`)	ClusterQueue + LocalQueue	Queue (`scheduling.run.ai`)
QOS / Priority	PriorityClass	WorkloadPriorityClass	`priorityClassName` label
Fair-share	DRF plugin	Cohort-based fair sharing	Time-based fairshare
`--nodes=N` (gang)	`minAvailable: N`	`parallelism: N` + `suspend: true`	`parallelism: N` (auto pod group)

The biggest UX gap vs. SLURM: there’s no single squeue equivalent that shows all pending jobs across all schedulers. Each scheduler has its own status command.

Which One Should You Use?

Volcano if you’re coming from HPC/SLURM and want the closest mental model. It has the widest ecosystem (Spark, MPI, PyTorch, Kubeflow), 6 years of production use, and CNCF backing. The VolcanoJob CRD feels most like sbatch. Azure ML on AKS uses Volcano under the hood.

Kueue if you want minimal disruption to your existing K8s setup. It doesn’t replace the scheduler — it just adds a queueing layer. This means you keep all existing scheduler plugins, topology rules, and pod affinity/anti-affinity. Best for teams that already have a mature K8s platform and want to add batch job management.

KAI if you’re running NVIDIA GPUs and want GPU-specific features: fractional GPU sharing, MIG-aware scheduling, NVLink topology-aware placement, and integration with the NVIDIA ecosystem (NVSentinel, GPU Operator, Run:ai). Youngest of the three but backed by NVIDIA’s production experience.

All three are free, open-source, and actively maintained. You can even run all three on the same cluster (as I did) and let different teams choose their preferred scheduler.

Gotcha: Volcano `schedulerName`

One issue I hit: Volcano’s Helm chart registers the scheduler as volcano, not volcano-scheduler. If you use the wrong name, pods stay Pending forever with no error message — the Volcano scheduler simply never sees them.

# ✗ Wrong — pods will be Pending forever
schedulerName: volcano-scheduler

# ✓ Correct
schedulerName: volcano

Also, the annotation-based approach (standard K8s Jobs with scheduling.volcano.sh/queue-name on the pod template) is more reliable than the VolcanoJob CRD for gang scheduling on newer Kubernetes versions.

Reproducing These Results

The cluster setup uses Terraform (AKS with Standard_ND96isr_H100_v5 GPU node pool) and Helm for each scheduler. The key install commands:

# Volcano
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

# Kueue
kubectl apply --server-side -f \
  https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.4/manifests.yaml

# KAI
helm install kai-scheduler \
  oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler \
  -n kai-scheduler --create-namespace --version v0.13.4

The job manifests shown in the Test Job section are complete and copy-pasteable — adjust the tolerations and GPU resource requests for your cluster.

This is a personal blog. Opinions and recommendations are my own, not Microsoft’s.

Gang Scheduling on AKS: Volcano vs Kueue vs KAI Scheduler

Introduction

Test Environment

How Each Scheduler Works

The Test Job

Volcano

Kueue

KAI Scheduler

Results

What About Real Training Jobs?

Queue Setup Comparison

Volcano — Minimal

Kueue — Most Verbose

KAI — Hierarchical

Command Cheat Sheet

SLURM Equivalents

Which One Should You Use?

Gotcha: Volcano `schedulerName`

Reproducing These Results

Share on

Leave a Comment

Introduction

Test Environment

How Each Scheduler Works

The Test Job

Volcano

Kueue

KAI Scheduler

Results

What About Real Training Jobs?

Queue Setup Comparison

Volcano — Minimal

Kueue — Most Verbose

KAI — Hierarchical

Command Cheat Sheet

SLURM Equivalents

Which One Should You Use?

Gotcha: Volcano schedulerName

Reproducing These Results

Share on

Leave a Comment

Gotcha: Volcano `schedulerName`