Job Submission
This guide covers all aspects of submitting jobs with gbatch, from basic usage to advanced features.
Overview
gbatch is gflow's job submission tool, similar to Slurm's sbatch. It supports both direct command execution and script-based job submission.
Basic Usage
Submitting a Command
The simplest way to submit a job is to provide the command directly:
gbatch python script.pyOutput:
Submitted batch job 1 (silent-pump-6338)No quotes needed for simple commands! Arguments are automatically joined:
gbatch python train.py --epochs 100 --lr 0.01Command Argument Safety
gflow automatically handles special characters in command arguments using shell escaping:
# Arguments with spaces
gbatch python script.py --message "Hello World"
# Arguments with special characters
gbatch python script.py --pattern 'test_*.py'
# Complex arguments
gbatch bash -c 'echo $USER && python script.py'How it works:
- Command arguments are properly escaped before execution
- Prevents shell injection and unintended command interpretation
- Special characters like spaces, quotes, and wildcards are handled safely
- Uses the
shell-escapelibrary to ensure safety
Best practice: While gflow handles escaping automatically, it's still recommended to:
- Test complex commands locally first
- Use explicit quoting for clarity
- Avoid overly complex inline commands (use script files instead)
Submitting a Script
Create a script file and submit it:
# Create script
cat > my_job.sh << 'EOF'
#!/bin/bash
echo "Hello from gflow!"
python train.py
EOF
# Make executable
chmod +x my_job.sh
# Submit
gbatch my_job.shResource Allocation
GPU Requests
Request GPUs for your job:
# Request 1 GPU
gbatch --gpus 1 python train.py
# Request 2 GPUs
gbatch --gpus 2 python multi_gpu_train.pyThe scheduler sets CUDA_VISIBLE_DEVICES automatically to the allocated GPUs.
Check GPU allocation:
$ gqueue -f JOBID,NAME,NODES,NODELIST
JOBID NAME NODES NODELIST(REASON)
42 silent-pump-6338 1 0
43 brave-river-1234 2 1,2Conda Environment
Activate a conda environment before running your job:
gbatch --conda-env myenv python script.pyThis is equivalent to running:
conda activate myenv
python script.pyJob Scheduling Options
Priority
Control when your job runs relative to others:
# High priority (runs first)
gbatch --priority 100 python urgent.py
# Default priority
gbatch python normal.py # priority = 10
# Low priority (runs last)
gbatch --priority 1 python background.pyPriority details:
- Range: 0-255
- Default: 10
- Higher values = higher priority
- Jobs are scheduled based on a multi-factor priority system (see below)
Scheduling Priority Hierarchy:
When resources become available, gflow schedules jobs using a three-level priority system:
- User Priority (Primary): Jobs with higher
--priorityvalues run first - Time Limit Bonus (Secondary): Among jobs with equal priority:
- Time-limited jobs are preferred over unlimited jobs
- Shorter jobs run before longer jobs
- Submission Order (Tertiary): Jobs submitted earlier run first (FIFO)
Examples:
# These jobs will run in the following order when GPUs become available:
# 1st: High priority, even though unlimited
gbatch --priority 20 python urgent.py
# 2nd: Same priority, but 10-minute limit beats unlimited
gbatch --priority 10 --time 10 python quick.py
# 3rd: Same priority, but 1-hour limit (submitted first)
gbatch --priority 10 --time 1:00:00 python train1.py # Job ID 100
# 4th: Same priority and limit, but submitted later
gbatch --priority 10 --time 1:00:00 python train2.py # Job ID 101
# 5th: Same priority, unlimited (submitted first)
gbatch --priority 10 python long1.py # Job ID 102
# 6th: Same priority, unlimited (submitted later)
gbatch --priority 10 python long2.py # Job ID 103Key Insights:
- Setting
--timenot only prevents runaway jobs but also improves scheduling priority - Shorter time limits get slight preference, encouraging accurate estimates
- Submission order acts as a fair tie-breaker when all else is equal
Time Limits
Set maximum runtime for jobs:
# 30 minutes
gbatch --time 30 python quick.py
# 2 hours
gbatch --time 2:00:00 python train.py
# 5 minutes 30 seconds
gbatch --time 5:30 python test.pySee Time Limits for comprehensive documentation.
Job Names
By default, jobs get auto-generated names (e.g., "silent-pump-6338"). You can specify custom names:
gbatch --name "my-training-run" python train.pyNote: The --name option is for custom naming. If not specified, a random name is generated.
Job Dependencies
Make jobs wait for other jobs to complete:
# Job 1: Preprocessing
gbatch --name "prep" python preprocess.py
# Returns: Submitted batch job 1
# Job 2: Training (waits for job 1)
gbatch --depends-on 1 --name "train" python train.py
# Job 3: Evaluation (waits for job 2)
gbatch --depends-on 2 --name "eval" python evaluate.pySee Job Dependencies for advanced dependency management.
Job Arrays
Run multiple similar tasks in parallel:
# Create 10 jobs with task IDs 1-10
gbatch --array 1-10 python process.py --task '$GFLOW_ARRAY_TASK_ID'How it works:
- Creates 10 separate jobs
- Each job has
$GFLOW_ARRAY_TASK_IDset to its task number - All jobs share the same resource requirements
- Useful for parameter sweeps, data processing, etc.
Example with different parameters:
gbatch --array 1-5 --gpus 1 --time 2:00:00 \
python train.py --lr '$(echo "0.001 0.01 0.1 0.5 1.0" | cut -d" " -f$GFLOW_ARRAY_TASK_ID)'Environment variable:
GFLOW_ARRAY_TASK_ID: Task ID for array jobs (1, 2, 3, ...)- Set to 0 for non-array jobs
Script Directives
Instead of command-line options, you can embed job requirements in your script using # GFLOW directives:
#!/bin/bash
# GFLOW --gpus 1
# GFLOW --time 2:00:00
# GFLOW --priority 20
# GFLOW --conda-env myenv
echo "Starting training..."
python train.py --epochs 100
echo "Training complete!"Submit the script:
gbatch my_script.shDirective precedence:
- Command-line arguments override script directives
- Example:
gbatch --time 1:00:00 my_script.shoverrides the--timedirective in the script
Supported directives:
# GFLOW --gpus <N># GFLOW --time <TIME># GFLOW --priority <N># GFLOW --conda-env <ENV># GFLOW --depends-on <ID>
Creating Script Templates
Use gbatch new to create a job script template:
$ gbatch new my_jobThis creates my_job.sh with a template:
#!/bin/bash
# GFLOW --gpus 0
# GFLOW --time 1:00:00
# GFLOW --priority 10
# Your commands here
echo "Job started at $(date)"
# Add your actual commands
# python script.py
echo "Job finished at $(date)"Edit the template and submit:
# Edit the script
vim my_job.sh
# Make executable
chmod +x my_job.sh
# Submit
gbatch my_job.shAutomatic Template Generation
The job script template is automatically generated from the gbatch CLI definition to ensure it always reflects available options:
- Template Source: The template is generated from
src/bin/gbatch/cli.rs - Automatic Sync: A pre-commit hook automatically regenerates the template when command-line options change
- Always Current: You always get the latest available options in your templates
For developers: See scripts/README.md for details on how the template generation works.
Environment Variables
gflow automatically sets these environment variables in your job:
| Variable | Description | Example |
|---|---|---|
CUDA_VISIBLE_DEVICES | GPU IDs allocated to the job | 0,1 |
GFLOW_ARRAY_TASK_ID | Task ID for array jobs (0 for non-array) | 5 |
Example usage:
#!/bin/bash
echo "Using GPUs: $CUDA_VISIBLE_DEVICES"
echo "Array task ID: $GFLOW_ARRAY_TASK_ID"
python train.pyOutput and Logging
Job output is automatically captured to log files:
Log location: ~/.local/share/gflow/logs/<job_id>.log
View logs:
# View completed job log
cat ~/.local/share/gflow/logs/42.log
# Follow running job log
tail -f ~/.local/share/gflow/logs/42.logAttach to running job (via tmux):
# Get job session name
gqueue -f JOBID,NAME
# Attach to session
tmux attach -t <session_name>
# Detach without stopping (Ctrl-B, then D)Advanced Examples
Parameter Sweep
Test multiple hyperparameters:
# Submit multiple training runs
for lr in 0.001 0.01 0.1; do
gbatch --gpus 1 --time 4:00:00 \
--name "train-lr-$lr" \
python train.py --lr $lr
donePipeline with Dependencies
# Step 1: Data preprocessing
gbatch --time 30 python preprocess.py
# Step 2: Training
gbatch --time 4:00:00 --gpus 1 --depends-on @ python train.py
# Step 3: Evaluation
gbatch --time 10 --depends-on @ python evaluate.pyThe @ symbol references the most recently submitted job, making pipelines simple and clean.
Multi-stage Job Script
#!/bin/bash
# GFLOW --gpus 1
# GFLOW --time 8:00:00
set -e # Exit on error
echo "Stage 1: Data preparation"
python prepare_data.py
echo "Stage 2: Model training"
python train.py --checkpoint model.pth
echo "Stage 3: Evaluation"
python evaluate.py --model model.pth
echo "All stages complete!"Conditional Job Submission
#!/bin/bash
# Submit job only if previous job succeeded
PREV_JOB=42
STATUS=$(gqueue -j $PREV_JOB -f ST | tail -n 1)
if [ "$STATUS" = "CD" ]; then
gbatch python next_step.py
else
echo "Previous job not completed successfully"
fiCommon Patterns
Long-running with Checkpointing
# train.py with checkpoint support
import signal
import sys
def save_checkpoint():
print("Saving checkpoint...")
# Save model state
torch.save(model.state_dict(), 'checkpoint.pth')
def signal_handler(sig, frame):
save_checkpoint()
sys.exit(0)
signal.signal(signal.SIGINT, signal_handler)
# Training loop
for epoch in range(epochs):
train_epoch()
if epoch % 10 == 0:
save_checkpoint()Submit with time limit:
gbatch --time 8:00:00 --gpus 1 python train.pyGPU Utilization Check
#!/bin/bash
# GFLOW --gpus 1
echo "Allocated GPUs: $CUDA_VISIBLE_DEVICES"
nvidia-smi --query-gpu=index,name,memory.total --format=csv
python train.pyValidation and Error Handling
gbatch validates your submission before accepting it:
Common validation errors:
Invalid dependency: Job ID doesn't exist
Error: Dependency job 999 not foundCircular dependency: Job depends on itself or creates a cycle
Error: Circular dependency detectedInvalid time format: Malformed time specification
Error: Invalid time format. Use HH:MM:SS, MM:SS, or MMScript not found: File doesn't exist
Error: Script file not found: missing.sh
Tips and Best Practices
- Always set time limits for production jobs to prevent runaway processes
- Use meaningful names for easier job tracking
- Test scripts locally before submitting
- Add error handling (
set -e) in bash scripts - Implement checkpointing for long-running jobs
- Use job arrays for parallel independent tasks
- Check dependencies before submitting dependent jobs
- Monitor GPU usage when requesting multiple GPUs
- Use conda environments for reproducibility
- Add logging to your scripts for easier debugging
Troubleshooting
Issue: Job submission fails with "dependency not found"
Solution: Verify the dependency job exists:
gqueue -j <dependency_id>Issue: Job doesn't get GPU
Check:
- Did you request GPU?
--gpus 1 - Are GPUs available?
ginfo - Are other jobs using all GPUs?
gqueue -s Running -f NODES,NODELIST
Issue: Conda environment not activating
Check:
- Environment name is correct:
conda env list - Conda is initialized in your shell
- Check job logs for activation errors
Issue: Script not executable
Solution:
chmod +x my_script.sh
gbatch my_script.shReference
Full command syntax:
gbatch [OPTIONS] <SCRIPT>
gbatch [OPTIONS] <COMMAND> [ARGS...]All options:
--gpus <N>or-g <N>: Number of GPUs--time <TIME>or-t <TIME>: Time limit--priority <N>: Job priority (0-255, default: 10)--depends-on <ID>: Job dependency--conda-env <ENV>or-c <ENV>: Conda environment--array <SPEC>: Job array (e.g., "1-10")--name <NAME>: Custom job name--config <PATH>: Custom config file (hidden)
Get help:
$ gbatch --help
<!-- cmdrun gbatch --help -->See Also
- Time Limits - Detailed time limit documentation
- Job Dependencies - Advanced dependency workflows
- GPU Management - GPU allocation and monitoring
- Quick Reference - Command cheat sheet