Restart a workload based on health checks

2min
|
Nomad

The check_restart stanza instructs Nomad when to restart tasks with unhealthy service checks. When a health check in Consul has been unhealthy for the limit specified in a check_restart stanza, it is restarted according to the task group's restart policy. Restarts are local to the node running the task based on the tasks restart policy.

The limit field is used to specify the number of times a failing health check is seen before local restarts are attempted. Operators can also specify a grace duration to wait after a task restarts before checking its health.

You should configure the check restart on services when its likely that a restart would resolve the failure. An example of this might be restarting to correct a transient connection issue on the service.

The following check_restart stanza waits for two consecutive health check failures with a grace period and considers both critical and warning statuses as failures.

check_restart {
  limit           = 2
  grace           = "10s"
  ignore_warnings = false
}

The following CLI example output shows health check failures triggering restarts until its restart limit is reached.

$ nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690
ID                   = e1b43128
Eval ID              = 249cbfe9
Name                 = demo.demo[0]
Node ID              = 221e998e
Job ID               = demo
Job Version          = 0
Client Status        = failed
Client Description   = <none>
Desired Status       = run
Desired Description  = <none>
Created              = 2m59s ago
Modified             = 39s ago

Task "test" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  p1: 127.0.0.1:28422

Task Events:
Started At     = 2018-04-12T22:50:32Z
Finished At    = 2018-04-12T22:50:54Z
Total Restarts = 3
Last Restart   = 2018-04-12T17:50:15-05:00

Recent Events:
Time                       Type              Description
2018-04-12T17:50:54-05:00  Not Restarting    Exceeded allowed attempts 3 in interval 30m0s and mode is "fail"
2018-04-12T17:50:54-05:00  Killed            Task successfully killed
2018-04-12T17:50:54-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:54-05:00  Restart Signaled  health check: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:50:32-05:00  Started           Task started by client
2018-04-12T17:50:15-05:00  Restarting        Task restarting in 16.887291122s
2018-04-12T17:50:15-05:00  Killed            Task successfully killed
2018-04-12T17:50:15-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:15-05:00  Restart Signaled  health check: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:49:53-05:00  Started           Task started by client

Configure restart behaviors

Reschedule behaviors