Tasks can sometimes fail due to network, CPU or memory issues on the node
running the task. In such situations, Nomad can reschedule the task on another
reschedule stanza can be used to configure how Nomad
should try placing failed tasks on another node in the cluster. Reschedule
attempts have a delay between each attempt, and the delay can be configured to
increase between each rescheduling attempt according to a configurable
delay_function. Consult the
reschedule stanza documentation for more
Service jobs are configured by default to have unlimited reschedule attempts. You should use the reschedule stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention.
The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad. The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts. The CLI also shows when the next reschedule will be attempted.
$ nomad job status demoID = demoName = demoSubmit Date = 2018-04-12T15:48:37-05:00Type = servicePriority = 50Datacenters = dc1Status = pendingPeriodic = falseParameterized = false SummaryTask Group Queued Starting Running Failed Complete Lostdemo 0 0 0 2 0 0 Future Rescheduling AttemptsTask Group Eval ID Eval Timedemo ee3de93f 5s from now AllocationsID Node ID Task Group Version Desired Status Created Modified39d7823d f2c2eaa6 demo 0 run failed 5s ago 5s agofafb011b f2c2eaa6 demo 0 run failed 11s ago 10s ago
$ nomad alloc status 3d0bID = 3d0bbdb1Eval ID = 79b846a9Name = demo.demoNode ID = 8a184f31Job ID = demoJob Version = 0Client Status = failedClient Description = <none>Desired Status = runDesired Description = <none>Created = 15s agoModified = 15s agoReschedule Attempts = 3/5Reschedule Eligibility = 25s from now Task "demo" is "dead"Task ResourcesCPU Memory Disk Addresses100 MHz 300 MiB 300 MiB p1: 127.0.0.1:27646 Task Events:Started At = 2018-04-12T20:44:25ZFinished At = 2018-04-12T20:44:25ZTotal Restarts = 0Last Restart = N/A Recent Events:Time Type Description2018-04-12T15:44:25-05:00 Not Restarting Policy allows no restarts2018-04-12T15:44:25-05:00 Terminated Exit Code: 1272018-04-12T15:44:25-05:00 Started Task started by client2018-04-12T15:44:25-05:00 Task Setup Building Task Directory2018-04-12T15:44:25-05:00 Received Task received by client