Skip to content

Job Lifecycle

This guide explains the complete lifecycle of jobs in gflow, including state transitions, status checking, and recovery operations.

Job States

gflow jobs can be in one of seven states:

StateShortDescription
QueuedPDJob is waiting to run (pending dependencies or resources)
HoldHJob is on hold by user request
RunningRJob is currently executing
FinishedCDJob completed successfully
FailedFJob terminated with an error
CancelledCAJob was cancelled by user or system
TimeoutTOJob exceeded its time limit

State Categories

Active States (job is not yet complete):

  • Queued, Hold, Running

Completed States (job has finished):

  • Finished, Failed, Cancelled, Timeout

State Transition Diagram

The following diagram keeps only the core transitions. Completed states are terminal.

Use the toolbar in the top-right corner to zoom, fit, download, or enter fullscreen.

State Transition Rules

From Queued:

  • Running: When dependencies are met AND resources are available
  • Hold: User runs gjob hold <job_id>
  • Cancelled: User runs gcancel <job_id> OR a dependency fails (with auto-cancel enabled)

From Hold:

  • Queued: User runs gjob release <job_id>
  • Cancelled: User runs gcancel <job_id>

From Running:

  • Finished: Job script/command exits with code 0
  • Failed: Job script/command exits with non-zero code
  • Cancelled: User runs gcancel <job_id>
  • Timeout: Job exceeds its time limit (set with --time)

From Completed States:

  • No transitions (final states)
  • Use gjob redo <job_id> to create a new job with the same parameters

Automatic Retries

  • Set a per-job retry budget with gbatch --max-retries <N> or gjob update <job_id> --max-retries <N>.
  • When a running job exits non-zero, gflow can submit a new queued attempt until that budget is exhausted.
  • Queued dependents are retargeted to the newest retry attempt automatically.
  • Timeouts and explicit fail requests remain terminal today.
  • Manual gjob redo stays separate from automatic retry tracking.

Job State Reasons

Jobs in certain states have an associated reason that provides more context:

StateReasonDescription
QueuedWaitingForDependencyJob is waiting for parent jobs to finish
QueuedWaitingForGpu (Resources)Job is waiting for available GPUs
QueuedWaitingForMemory (Resources)Job is waiting for available host memory
QueuedWaitingForResourcesJob is waiting for other scheduler-managed resources/limits
HoldJobHeldUserJob was put on hold by user request
CancelledCancelledByUserUser explicitly cancelled the job
CancelledDependencyFailed:<job_id>Job was auto-cancelled because job <job_id> failed
CancelledSystemError:<msg>Job was cancelled due to a system error

View the reason with gjob show <job_id> or gqueue -f JOBID,ST,REASON.

Status Checking Workflow

The following diagram shows a simplified check -> action -> recheck loop:

See Also

Released under the MIT License.