Skip to content

GPU Management

gflow detects NVIDIA GPUs (via NVML) and allocates them to jobs by setting CUDA_VISIBLE_DEVICES.

Quick Start

bash
# Start the daemon (if not already running)
gflowd up

# See availability + current allocations
ginfo

# Submit a GPU job
gbatch --gpus 1 python train.py

# Track jobs and allocations
gqueue -s Running,Queued -f JOBID,NAME,ST,NODES,NODELIST(REASON)

Inspect GPUs

bash
ginfo

Example output:

PARTITION  GPUS  NODES  STATE      JOB(REASON)
gpu        1     1      idle
gpu        1     0      allocated  5 (train-resnet)
  • NODES shows the physical GPU indices.
  • If a GPU is busy but not allocated by gflow, it may appear with a reason (when available).

Non-gflow GPU usage:

  • If NVML reports running compute processes on a GPU, gflow treats it as unavailable (often shown as Unmanaged) and will not allocate it.
  • gflow does not preempt/kill non-gflow processes; jobs wait until the GPU becomes idle.

If you need per-GPU restriction status (allowed vs restricted):

bash
gctl show-gpus

Requirements

  • NVIDIA GPU(s) + driver
  • NVML library available (libnvidia-ml.so)

Quick check:

bash
nvidia-smi
gflowd up
ginfo

On systems without GPUs, gflow still works; only GPU allocation is unavailable.

Request GPUs

bash
gbatch --gpus 1 python train.py
gbatch --gpus 2 python multi_gpu_train.py

When a job starts, gflow assigns physical GPU indices and exports them via CUDA_VISIBLE_DEVICES (which frameworks typically renumber starting from 0).

To see allocated GPU IDs:

bash
gqueue -s Running -f JOBID,NAME,ST,NODES,NODELIST(REASON)
gjob show <job_id>

Shared GPU Mode

Use shared mode when you want multiple jobs to co-locate on one physical GPU.

bash
gbatch --gpus 1 --shared --gpu-memory 20G python train.py
  • --shared jobs only share with other --shared jobs.
  • --shared requires a per-GPU VRAM limit via --gpu-memory (alias: --max-gpu-mem).
  • --memory (--max-mem) is still host RAM, not GPU VRAM.

GPU Visibility

bash
#!/bin/bash
# GFLOW --gpus 2

echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
python train.py

Restrict Which GPUs gflow Uses

Limit which physical GPUs the scheduler is allowed to allocate (affects new allocations only):

bash
gctl set-gpus 0,2
gctl show-gpus

# Or via daemon CLI flag (overrides config)
gflowd restart --gpus 0-3

See also: Configuration -> GPU Selection.

Choose GPU Allocation Strategy

When multiple GPUs are available for a job, you can choose how gflow selects them:

  • sequential (default): picks lower indices first.
  • random: randomizes GPU selection order.
toml
[daemon]
gpu_allocation_strategy = "sequential"
# gpu_allocation_strategy = "random"

Or override on daemon startup:

bash
gflowd up --gpu-allocation-strategy random

Troubleshooting

Job not getting GPU

bash
ginfo
gqueue -j <job_id> -f JOBID,ST,NODES,NODELIST(REASON)
gctl show-gpus

Job sees wrong GPUs

bash
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
gqueue -f JOBID,NODELIST(REASON)

Out of memory

bash
nvidia-smi --query-gpu=memory.free,memory.used --format=csv

If shared jobs fail with OOM, verify --gpu-memory is set and sized appropriately for each job.

See Also

Released under the MIT License.