Design Nomad Jobs for Resiliency

Restart a Workload Based on Health Checks

The check_restart stanza instructs Nomad when to restart tasks with unhealthy service checks. When a health check in Consul has been unhealthy for the limit specified in a check_restart stanza, it is restarted according to the task group's restart policy.

The limit field is used to specify the number of times a failing health check is seen before local restarts are attempted. Operators can also specify a grace duration to wait after a task restarts before checking its health.

You should configure the check restart on services when its likely that a restart would resolve the failure. An example of this might be restarting to correct a transient connection issue on the service.

The following check_restart stanza waits for two consecutive health check failures with a grace period and considers both critical and warning statuses as failures.

check_restart {
  limit           = 2
  grace           = "10s"
  ignore_warnings = false

The following CLI example output shows health check failures triggering restarts until its restart limit is reached.

$ nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690
ID                   = e1b43128
Eval ID              = 249cbfe9
Name                 = demo.demo[0]
Node ID              = 221e998e
Job ID               = demo
Job Version          = 0
Client Status        = failed
Client Description   = <none>
Desired Status       = run
Desired Description  = <none>
Created              = 2m59s ago
Modified             = 39s ago

Task "test" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  p1:

Task Events:
Started At     = 2018-04-12T22:50:32Z
Finished At    = 2018-04-12T22:50:54Z
Total Restarts = 3
Last Restart   = 2018-04-12T17:50:15-05:00

Recent Events:
Time                       Type              Description
2018-04-12T17:50:54-05:00  Not Restarting    Exceeded allowed attempts 3 in interval 30m0s and mode is "fail"
2018-04-12T17:50:54-05:00  Killed            Task successfully killed
2018-04-12T17:50:54-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:54-05:00  Restart Signaled  health check: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:50:32-05:00  Started           Task started by client
2018-04-12T17:50:15-05:00  Restarting        Task restarting in 16.887291122s
2018-04-12T17:50:15-05:00  Killed            Task successfully killed
2018-04-12T17:50:15-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:15-05:00  Restart Signaled  health check: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:49:53-05:00  Started           Task started by client