Job Dependencies
This guide covers how to create complex workflows using job dependencies in gflow.
Overview
Job dependencies allow you to create workflows where jobs wait for other jobs to complete before starting. This is essential for:
- Multi-stage pipelines (preprocessing → training → evaluation)
- Sequential workflows with data dependencies
- Conditional execution based on previous results
- Resource optimization (release GPUs between stages)
Basic Usage
Simple Dependency
Submit a job that depends on another:
# Job 1: Preprocessing
$ gbatch --name "prep" python preprocess.py
Submitted batch job 1 (prep)
# Job 2: Training (waits for job 1)
$ gbatch --depends-on 1 --name "train" python train.py
Submitted batch job 2 (train)How it works:
- Job 2 starts only after Job 1 completes successfully (state:
Finished) - If Job 1 fails, Job 2 remains in
Queuedstate indefinitely - You must manually cancel Job 2 if Job 1 fails
@ Syntax Sugar: You can reference recent submissions without copying IDs:
--depends-on @- Most recent submission (last job submitted)--depends-on @~1- Second most recent submission--depends-on @~2- Third most recent submission- And so on...
This makes creating pipelines much simpler!
Checking Dependencies
View dependency relationships:
$ gqueue -t
JOBID NAME ST TIME TIMELIMIT
1 prep CD 00:02:15 UNLIMITED
└─ 2 train R 00:05:30 04:00:00
└─ 3 eval PD 00:00:00 00:10:00The tree view (-t) shows the dependency hierarchy with ASCII art.
Creating Workflows
Linear Pipeline
Execute jobs in sequence using @ syntax:
# Stage 1: Data collection
gbatch --time 10 python collect_data.py
# Stage 2: Data preprocessing (depends on stage 1)
gbatch --time 30 --depends-on @ python preprocess.py
# Stage 3: Training (depends on stage 2)
gbatch --time 4:00:00 --gpus 1 --depends-on @ python train.py
# Stage 4: Evaluation (depends on stage 3)
gbatch --time 10 --depends-on @ python evaluate.pyWatch the pipeline:
watch -n 5 gqueue -tHow it works: Each --depends-on @ references the job submitted immediately before it, creating a clean sequential pipeline.
Parallel Processing with Join
Multiple jobs feeding into one:
# Parallel data processing tasks
gbatch --time 30 python process_part1.py
gbatch --time 30 python process_part2.py
gbatch --time 30 python process_part3.py
# Merge results (waits for the last parallel task)
gbatch --depends-on @ python merge_results.pyCurrent limitation: gflow currently supports only one dependency per job. The example above shows merge_results.py depending on the last parallel task (process_part3.py). For true multi-parent dependencies (waiting for ALL parallel tasks), you need intermediate coordination jobs or submit merge after checking all parallel jobs completed.
Branching Workflow
One job triggering multiple downstream jobs:
# Main processing
gbatch --time 1:00:00 python main_process.py
# Multiple analysis jobs (all depend on the main job)
gbatch --depends-on @ --time 30 python analysis_a.py
gbatch --depends-on @~1 --time 30 python analysis_b.py
gbatch --depends-on @~2 --time 30 python analysis_c.pyExplanation:
- First analysis depends on
@(the main_process job) - Second analysis depends on
@~1(skipping analysis_a, back to main_process) - Third analysis depends on
@~2(skipping analysis_a and analysis_b, back to main_process)
Dependency States and Behavior
When Dependencies Start
A job with dependencies transitions from Queued to Running when:
- The dependency job reaches
Finishedstate - Required resources (GPUs, etc.) are available
Failed Dependencies
If a dependency job fails:
- The dependent job remains in
Queuedstate - It will never start automatically
- You must manually cancel it with
gcancel
Example:
# Job 1 fails
$ gqueue
JOBID NAME ST TIME
1 prep F 00:01:23
2 train PD 00:00:00
# Job 2 will never run - must cancel it
$ gcancel 2Timeout Dependencies
If a dependency job times out:
- State changes to
Timeout(TO) - Treated the same as
Failed - Dependent jobs remain queued
Cancelled Dependencies
If you cancel a job with dependencies:
- The job is cancelled
- Dependent jobs remain in queue (won't start)
- Use
gcancel --dry-runto see impact before cancelling
Check cancellation impact:
$ gcancel --dry-run 1
Would cancel job 1 (prep)
Warning: The following jobs depend on job 1:
- Job 2 (train)
- Job 3 (eval)
These jobs will never start if job 1 is cancelled.Dependency Visualization
Tree View
The tree view shows job dependencies clearly:
$ gqueue -t
JOBID NAME ST TIME TIMELIMIT
1 data-prep CD 00:05:23 01:00:00
├─ 2 train-model-a R 00:15:45 04:00:00
│ └─ 4 eval-a PD 00:00:00 00:10:00
└─ 3 train-model-b R 00:15:50 04:00:00
└─ 5 eval-b PD 00:00:00 00:10:00Legend:
├─: Branch connection└─: Last child connection│: Continuation line
Circular Dependency Detection
gflow detects and prevents circular dependencies:
# This will fail
$ gbatch --depends-on 2 python a.py
Submitted batch job 1
$ gbatch --depends-on 1 python b.py
Error: Circular dependency detected: Job 2 depends on Job 1, which depends on Job 2Protection:
- Validation happens at submission time
- Prevents deadlocks in the job queue
- Ensures all dependencies can eventually resolve
Advanced Patterns
Checkpointed Pipeline
Resume from failure points:
#!/bin/bash
# pipeline.sh - Resume from checkpoints
set -e
if [ ! -f "data.pkl" ]; then
echo "Stage 1: Preprocessing"
python preprocess.py
fi
if [ ! -f "model.pth" ]; then
echo "Stage 2: Training"
python train.py
fi
echo "Stage 3: Evaluation"
python evaluate.pySubmit:
gbatch --gpus 1 --time 8:00:00 pipeline.shConditional Dependency Script
Create a script that submits jobs based on previous results:
#!/bin/bash
# conditional_submit.sh
# Wait for job 1 to complete
while [ "$(gqueue -j 1 -f ST | tail -n 1)" = "R" ]; do
sleep 5
done
# Check if it succeeded
STATUS=$(gqueue -j 1 -f ST | tail -n 1)
if [ "$STATUS" = "CD" ]; then
echo "Job 1 succeeded, submitting next job"
gbatch python next_step.py
else
echo "Job 1 failed with status: $STATUS"
exit 1
fiArray Jobs with Dependencies
Create job arrays that depend on a preprocessing job:
# Preprocessing
gbatch --time 30 python preprocess.py
# Array of training jobs (all depend on preprocessing)
for i in {1..5}; do
gbatch --depends-on @ --gpus 1 --time 2:00:00 \
python train.py --fold $i
doneNote: All array jobs use --depends-on @ which references the preprocessing job since it's always the most recent non-array submission before the loop starts.
Resource-Efficient Pipeline
Release GPUs between stages:
# Stage 1: CPU-only preprocessing
gbatch --time 30 python preprocess.py
# Stage 2: GPU training
gbatch --depends-on @ --gpus 2 --time 4:00:00 python train.py
# Stage 3: CPU-only evaluation
gbatch --depends-on @ --time 10 python evaluate.pyBenefit: GPUs are only allocated when needed, maximizing resource utilization.
Monitoring Dependencies
Check Dependency Status
# View specific job and its dependencies
gqueue -j 1,2,3 -f JOBID,NAME,ST,TIME
# View all jobs in tree format
gqueue -t
# Filter by state and view dependencies
gqueue -s Queued,Running -tWatch Pipeline Progress
# Real-time monitoring
watch -n 2 'gqueue -t'
# Show only active jobs
watch -n 2 'gqueue -s Running,Queued -t'Identify Blocked Jobs
Find jobs waiting on dependencies:
# Show queued jobs with dependency info
gqueue -s Queued -t
# Check why a job is queued
gqueue -j 5 -f JOBID,NAME,ST
gqueue -t | grep -A5 "^5"Dependency Validation
Submission-time Validation
gbatch validates dependencies when you submit:
✅ Valid submissions:
- Dependency job exists
- No circular dependencies
- Dependency is not the job itself
❌ Invalid submissions:
- Dependency job doesn't exist:
Error: Dependency job 999 not found - Circular dependency:
Error: Circular dependency detected - Self-dependency:
Error: Job cannot depend on itself
Runtime Behavior
During execution:
- Scheduler checks dependencies every 5 seconds
- Jobs start when dependencies are
FinishedAND resources are available - Failed/timeout dependencies never trigger dependent jobs
Practical Examples
Example 1: ML Training Pipeline
# Complete ML pipeline using @ syntax
gbatch --time 20 python prepare_dataset.py
gbatch --depends-on @ --gpus 1 --time 8:00:00 \
python train.py --output model.pth
gbatch --depends-on @ --time 15 \
python evaluate.py --model model.pth
gbatch --depends-on @ --time 5 python generate_report.pyExample 2: Data Processing Pipeline
#!/bin/bash
# Submit a data processing pipeline
echo "Submitting data processing pipeline..."
# Download data
gbatch --time 1:00:00 --name "download" python download_data.py
# Validate data
gbatch --depends-on @ --time 30 --name "validate" python validate_data.py
# Transform data
gbatch --depends-on @ --time 45 --name "transform" python transform_data.py
# Upload results
gbatch --depends-on @ --time 30 --name "upload" python upload_results.py
echo "Pipeline submitted. Monitor with: watch gqueue -t"Example 3: Hyperparameter Sweep with Evaluation
# Train multiple models
for lr in 0.001 0.01 0.1; do
gbatch --gpus 1 --time 2:00:00 \
python train.py --lr $lr --output model_$lr.pth
done
# Wait for all models, then evaluate
# (Depends on the last model trained)
gbatch --depends-on @ --time 30 \
python compare_models.py --models model_*.pthNote: For true multi-dependency support (waiting for ALL models), you would need to either:
- Use a script that checks job status before submitting
- Submit the comparison job manually after all training completes
Troubleshooting
Issue: Dependent job not starting
Possible causes:
Dependency job hasn't finished:
bashgqueue -tDependency job failed:
bashgqueue -j <dep_id> -f JOBID,STNo resources available (GPUs):
bashginfo gqueue -s Running -f NODES,NODELIST
Issue: Want to cancel a job with dependencies
Solution: Use dry-run first to see impact:
# See what would happen
gcancel --dry-run <job_id>
# Cancel if acceptable
gcancel <job_id>
# Cancel dependent jobs too if needed
gcancel <job_id>
gcancel <dependent_job_id>Issue: Circular dependency error
Solution: Review your dependency chain:
# Check the job sequence
gqueue -j <job_ids> -t
# Restructure to eliminate cyclesIssue: Lost track of dependencies
Solution: Use tree view:
# Show all job relationships
gqueue -a -t
# Focus on specific jobs
gqueue -j 1,2,3,4,5 -tBest Practices
- Plan workflows before submitting jobs
- Use meaningful names for jobs in pipelines (
--nameflag) - Use @ syntax for simpler dependency chains
- Set appropriate time limits for each stage
- Monitor pipelines with
watch gqueue -t - Handle failures by checking dependency status
- Use dry-run before cancelling jobs with dependents
- Document pipelines in submission scripts
- Test small before submitting long pipelines
- Check logs when dependencies fail
Multi-Job Dependencies
gflow supports advanced dependency patterns where a job can depend on multiple parent jobs with different logic modes.
AND Logic (All Dependencies Must Succeed)
Use --depends-on-all when a job should wait for all parent jobs to finish successfully:
# Run three preprocessing jobs in parallel
gbatch --time 30 python preprocess_part1.py # Job 101
gbatch --time 30 python preprocess_part2.py # Job 102
gbatch --time 30 python preprocess_part3.py # Job 103
# Training waits for ALL preprocessing jobs to complete
gbatch --depends-on-all 101,102,103 --gpus 2 --time 4:00:00 python train.pyWith @ syntax:
gbatch python preprocess_part1.py # Job 101
gbatch python preprocess_part2.py # Job 102
gbatch python preprocess_part3.py # Job 103
# Use @ syntax to reference recent jobs
gbatch --depends-on-all @,@~1,@~2 --gpus 2 python train.pyOR Logic (Any Dependency Must Succeed)
Use --depends-on-any when a job should start as soon as any one parent job finishes successfully:
# Try multiple data sources in parallel
gbatch --time 10 python fetch_from_source_a.py # Job 201
gbatch --time 10 python fetch_from_source_b.py # Job 202
gbatch --time 10 python fetch_from_source_c.py # Job 203
# Process data from whichever source succeeds first
gbatch --depends-on-any 201,202,203 python process_data.pyUse case: Fallback scenarios where multiple approaches are tried in parallel, and you want to proceed with the first successful result.
Auto-Cancellation
By default, when a parent job fails, all dependent jobs are automatically cancelled:
gbatch python preprocess.py # Job 301 - fails
# Job 302 will be auto-cancelled when 301 fails
gbatch --depends-on 301 python train.py # Job 302Disable auto-cancellation if you want dependent jobs to remain queued:
gbatch --depends-on 301 --no-auto-cancel python train.pyWhen auto-cancellation happens:
- Parent job fails (state:
Failed) - Parent job is cancelled (state:
Cancelled) - Parent job times out (state:
Timeout)
Cascade cancellation: If job A depends on B, and B depends on C, when C fails, both B and A are cancelled automatically.
Circular Dependency Detection
gflow automatically detects and prevents circular dependencies at submission time:
gbatch python job_a.py # Job 1
gbatch --depends-on 1 python job_b.py # Job 2
# This will be rejected with an error
gbatch --depends-on 2 python job_c.py --depends-on-all 1,2 # Would create cycleError message:
Circular dependency detected: Job 3 depends on Job 2, which has a path back to Job 3Complex Workflow Example
Combining AND and OR logic for sophisticated workflows:
# Stage 1: Try multiple data collection methods
gbatch --time 30 python collect_method_a.py # Job 1
gbatch --time 30 python collect_method_b.py # Job 2
# Stage 2: Process whichever collection succeeds first
gbatch --depends-on-any 1,2 --time 1:00:00 python process_data.py # Job 3
# Stage 3: Run multiple preprocessing tasks in parallel
gbatch --depends-on 3 --time 30 python preprocess_features.py # Job 4
gbatch --depends-on 3 --time 30 python preprocess_labels.py # Job 5
# Stage 4: Training waits for both preprocessing tasks
gbatch --depends-on-all 4,5 --gpus 2 --time 8:00:00 python train.py # Job 6
# Stage 5: Evaluation depends on training
gbatch --depends-on 6 --time 30 python evaluate.py # Job 7Visualize with tree view:
$ gqueue -t
JOBID NAME ST TIME TIMELIMIT
1 collect_method_a CD 00:15:30 00:30:00
2 collect_method_b F 00:10:00 00:30:00
└─ 3 process_data CD 00:45:00 01:00:00
├─ 4 preprocess_features CD 00:20:00 00:30:00
└─ 5 preprocess_labels CD 00:18:00 00:30:00
└─ 6 train R 02:30:00 08:00:00
└─ 7 evaluate PD 00:00:00 00:30:00Limitations
Remaining limitations:
- No dependency on specific job states (e.g., "start when job X fails")
- No job groups or batch dependencies beyond max_concurrent
Removed limitations (now supported):
Only one dependency per job→ Now supports multiple dependencies with--depends-on-alland--depends-on-anyNo automatic cancellation→ Now auto-cancels by default (can be disabled with--no-auto-cancel)
Workarounds:
- For state-based dependencies, use conditional scripts that check job status
- Use external workflow managers for very complex DAGs if needed
See Also
- Job Submission - Complete job submission guide
- Time Limits - Managing job timeouts
- Quick Reference - Command cheat sheet
- Quick Start - Basic usage examples