Design Nomad Jobs for Resiliency

Define Reschedule Behaviors for a Job

Tasks can sometimes fail due to network, CPU or memory issues on the node running the task. In such situations, Nomad can reschedule the task on another node. The reschedule stanza can be used to configure how Nomad should try placing failed tasks on another node in the cluster. Reschedule attempts have a delay between each attempt, and the delay can be configured to increase between each rescheduling attempt according to a configurable delay_function. Consult the reschedule stanza documentation for more information.

Service jobs are configured by default to have unlimited reschedule attempts. You should use the reschedule stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention.

The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad. The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts. The CLI also shows when the next reschedule will be attempted.

$ nomad job status demo
ID            = demo
Name          = demo
Submit Date   = 2018-04-12T15:48:37-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Task Group  Queued  Starting  Running  Failed  Complete  Lost
demo        0       0         0        2       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
demo        ee3de93f  5s from now

ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
39d7823d  f2c2eaa6  demo        0        run      failed  5s ago   5s ago
fafb011b  f2c2eaa6  demo        0        run      failed  11s ago  10s ago

$ nomad alloc status 3d0b
ID                     = 3d0bbdb1
Eval ID                = 79b846a9
Name                   = demo.demo[0]
Node ID                = 8a184f31
Job ID                 = demo
Job Version            = 0
Client Status          = failed
Client Description     = <none>
Desired Status         = run
Desired Description    = <none>
Created                = 15s ago
Modified               = 15s ago
Reschedule Attempts    = 3/5
Reschedule Eligibility = 25s from now

Task "demo" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  p1:

Task Events:
Started At     = 2018-04-12T20:44:25Z
Finished At    = 2018-04-12T20:44:25Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2018-04-12T15:44:25-05:00  Not Restarting  Policy allows no restarts
2018-04-12T15:44:25-05:00  Terminated      Exit Code: 127
2018-04-12T15:44:25-05:00  Started         Task started by client
2018-04-12T15:44:25-05:00  Task Setup      Building Task Directory
2018-04-12T15:44:25-05:00  Received        Task received by client