GPU Management

gflow detects NVIDIA GPUs (via NVML) and allocates them to jobs by setting CUDA_VISIBLE_DEVICES.

Quick Start

bash

# Start the daemon (if not already running)
gflowd up

# See availability + current allocations
ginfo

# Submit a GPU job
gbatch --gpus 1 python train.py

# Track jobs and allocations
gqueue -s Running,Queued -f JOBID,NAME,ST,NODES,NODELIST(REASON)

Inspect GPUs

bash

ginfo

Example output:

PARTITION  GPUS  NODES  STATE      JOB(REASON)
gpu        1     1      idle
gpu        1     0      allocated  5 (train-resnet)

NODES shows the physical GPU indices.
If a GPU is busy but not allocated by gflow, it may appear with a reason (when available).

Non-gflow GPU usage:

If NVML reports running compute processes on a GPU, gflow treats it as unavailable (often shown as Unmanaged) and will not allocate it.
gflow does not preempt/kill non-gflow processes; jobs wait until the GPU becomes idle.

If you need per-GPU restriction status (allowed vs restricted):

bash

gctl show-gpus

Requirements

NVIDIA GPU(s) + driver
NVML library available (libnvidia-ml.so)

Quick check:

bash

nvidia-smi
gflowd up
ginfo

On systems without GPUs, gflow still works; only GPU allocation is unavailable.

Request GPUs

bash

gbatch --gpus 1 python train.py
gbatch --gpus 2 python multi_gpu_train.py

When a job starts, gflow assigns physical GPU indices and exports them via CUDA_VISIBLE_DEVICES (which frameworks typically renumber starting from 0).

To see allocated GPU IDs:

bash

gqueue -s Running -f JOBID,NAME,ST,NODES,NODELIST(REASON)
gjob show <job_id>

Shared GPU Mode

Use shared mode when you want multiple jobs to co-locate on one physical GPU.

bash

gbatch --gpus 1 --shared --gpu-memory 20G python train.py

--shared jobs only share with other --shared jobs.
--shared requires a per-GPU VRAM limit via --gpu-memory (alias: --max-gpu-mem).
--memory (--max-mem) is still host RAM, not GPU VRAM.

GPU Visibility

bash

#!/bin/bash
# GFLOW --gpus 2

echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
python train.py

Restrict Which GPUs gflow Uses

Limit which physical GPUs the scheduler is allowed to allocate (affects new allocations only):

bash

gctl set-gpus 0,2
gctl show-gpus

# Or via daemon CLI flag (overrides config)
gflowd restart --gpus 0-3

Choose GPU Allocation Strategy

When multiple GPUs are available for a job, you can choose how gflow selects them:

sequential (default): picks lower indices first.
random: randomizes GPU selection order.

toml

[daemon]
gpu_allocation_strategy = "sequential"
# gpu_allocation_strategy = "random"

Or override on daemon startup:

bash

gflowd up --gpu-allocation-strategy random

Troubleshooting

Job not getting GPU

bash

ginfo
gqueue -j <job_id> -f JOBID,ST,NODES,NODELIST(REASON)
gctl show-gpus

Job sees wrong GPUs

bash

echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
gqueue -f JOBID,NODELIST(REASON)

Out of memory

bash

nvidia-smi --query-gpu=memory.free,memory.used --format=csv

If shared jobs fail with OOM, verify --gpu-memory is set and sized appropriately for each job.

GPU Management ​

Quick Start ​

Inspect GPUs ​

Requirements ​

Request GPUs ​

Shared GPU Mode ​

GPU Visibility ​

Restrict Which GPUs gflow Uses ​

Choose GPU Allocation Strategy ​

Troubleshooting ​

Job not getting GPU ​

Job sees wrong GPUs ​

Out of memory ​

See Also ​

GPU Management

Quick Start

Inspect GPUs

Requirements

Request GPUs

Shared GPU Mode

GPU Visibility

Restrict Which GPUs gflow Uses

Choose GPU Allocation Strategy

Troubleshooting

Job not getting GPU

Job sees wrong GPUs

Out of memory

See Also