April 6 & 7
Learn about Vault, Consul, & more at HashiDays Sydney in Australia Register Now

Service Discovery and Consul DNS

Use Health Checks With Consul

One important feature of the Consul agent is to manage system-level and application-level health checks. A health check is considered to be application-level if it is associated with a service. If not associated with a service, the check monitors the health of the entire node.

Checks are useful for monitoring the state of your services inside your datacenter and can be applied to many different use cases. Ultimately, Consul leverages checks to maintain accurate DNS query results by omitting services and nodes that are marked unhealthy.

In this guide, you'll learn how to select the best health check to configure depending on the entity you want to monitor, how to write a health check definition and how to register a service and a node health check manually. Finally, you'll learn how to monitor the state of the services using the resources natively provided by Consul: UI, HTTP API, and CLI.

Prerequisites

To complete this guide, you can use a local dev agent or existing Consul deployment. Be sure to enable the UI, if you're using a local dev agent add the -ui flag.

$ mkdir cosul-data
$ mkdir consul-conf
$ consul agent -dev -ui -data-dir=./consul-data -config-dir=./consul-conf -enable-local-script-checks -bootstrap-expect=1 -log-level=DEBUG

In order to provide a valid scenario to test checks you can use the following demo services:

You can download the binaries at the links above and execute them with the following configuration:

$ PORT=5000 ./counting-service &
$ PORT=5001 COUNTING_SERVICE_URL="http://localhost:5000" ./dashboard-service &

How to register a check

There are three steps for registering a check in Consul.

  1. Define monitoring scope: Decide if you want the check to monitor a service or a node.
  2. Write check definition: Select the type of check you want to register and write the definition.
  3. Register the check: Register the check using one of the available methods.

In this guide, you will complete all three steps.

Define monitoring scope

Before writing the check definition, it is best practice to define the monitoring scope.

Monitor a service

Checks can be registered in association with a service by either embedding the check definition inside the service definition or by associating them with a service using the ServiceID parameter. When registered in association with a service definition, the check will only affect the health of the service it is associated with. For example, if associated with a service called database, the failure of the check will only affect the availability of the database service. All other services provided by the node will remain unchanged. This is the perfect approach for monitoring the health of a service and make sure the DNS interface only returns services that are up and running. You will use this method in the guide.

Monitor an external service

The steps provided by this guide are for services that have a local to the Consul agent. If you want to monitor a service that runs on a node where you cannot run a local Consul agent, you can follow the steps provided in External Services. Once familiarized with the steps to register an external service, you can then apply concepts present in this guide to define checks for your external services too.

Monitor a node

When a check is not associated with a service, it will monitor the health of the whole node. This is not a common configuration, but it is perfect in case you want to ensure that the node is not used to serve traffic in case some basic health requirements are not respected. One possible scenario for this case is to setup a check for hardware resources (RAM, CPU usage, or disk space) and mark the node unhealthy until those parameters are back below the desired threshold.

Write check definition

Consul provides you with an ample range of options when it comes to health checks; review the full list of available checks in the Consul documentation documentation. In this guide you'll get an overview of the most common ones.

  • Script + Interval
  • HTTP + Interval
  • TCP + Interval
  • Alias

Write a script + interval check

Often, especially when migrating legacy applications to the cloud, you already have some customized scripts that monitor your machines to ensure they're healthy. Script checks allow you to re-use those scripts with Consul. Another use case for script checks are when you want to perform more complex checks that might rely on the underlying OS.

Add the following script check definition to a Consul agent.

{
  "check": {
    "id": "mem-util",
    "name": "Memory utilization",
    "args": [
      "/bin/sh",
      "-c",
      "/usr/bin/free | awk '/Mem/{printf($3/$2*100)}' | awk '{ print($0); if($1 > 70) exit 1;}'"
    ],
    "interval": "10s",
    "timeout": "1s"
  }
}

This script measures the memory usage of a Linux machine and returns a warning state if it rises above 70%.

Tuning scripts to be compatible with Consul

Consul doesn't put limitations on the operations the scripts can perform but it uses a convention on the script exit code to decide the status of the script.

  • 0 - Passing
  • 1 - Warning
  • anything else - Failing/Critical

Read more on this at check scripts documentation.

Enabling scripts on your Consul agent

Script checks must be enabled in the agent's configuration so that they have permissions to execute scripts locally.

Write an HTTP + interval check

HTTP checks are the perfect approach in case the service you want to monitor provides an endpoint that gives state information. The status of the service will depend on the HTTP response code. Any 2xx code is considered passing, a 429 Too ManyRequests is a warning, and anything else is a failure. This type of check should be preferred over a script that uses curl or another external process to check a simple HTTP operation. The Dashboard service you configured in the prerequisites provides a /health endpoint that is the perfect recipient for these checks.

In the registration section you'll learn how to embed the following check definition inside a service definition in order to be able to use it with the HTTP API.

{
  "check": {
    "id": "dashboard_check",
    "name": "Check Dashboard health 5001",
    "service_id": "dashboard_1",
    "http": "http://localhost:5001/health",
    "method": "GET",
    "interval": "10s",
    "timeout": "1s"
  }
}

This check will make an HTTP GET request to the URL specified in the http field, waiting the specified interval amount of time between requests.

In case you want to have the check definition in a standalone file (i.e. not associated with the service one) you will want to specify service_id to have the check associated to the correct service.

Write an TCP + interval check

Not all applications expose an HTTP endpoint to be monitored using an HTTP check. For these applications the best approach is to use TCP checks.

Once a TCP check is configured Consul will attempt to connect to the specified port, and address if specified, and will define the service health based on the connection attempt:

  • if the connection is accepted, the status is success.
  • otherwise the status is critical.

The Counting service you configured in the prerequisites is a good use case to use this check.

In the registration section you'll learn how to embed the check definition inside a service definition in order to be able to use it with the HTTP API.

{
  "check": {
    "id": "counting_check",
    "name": "Check Counter health 5000",
    "service_id": "counting_1",
    "tcp": "localhost:5000",
    "interval": "10s",
    "timeout": "1s"
  }
}

This check makes a TCP connection attempt to the IP/hostname and port specified in the tcp field, waiting interval amount of time between attempts.

In case you want to have the check definition in a standalone file (i.e. not associated with the service one) you will want to specify service_id to have the check associated to the correct service.

Challenge: Write an alias check

Sometimes a service can be healthy but one of more of their dependencies are not. This can result in requests being sent to a service that in the best case would not answer but could also respond with some unpredictable content. One valid example could be a two-tier application with frontend and backend where the backend is the dependency.

To avoid this scenario, one option is to add an additional check that monitors the backend service and will be associated with the frontend service. However, this can generate additional load or network traffic to check a service that is already monitored.

Consul provides an elegant solution to that by defining an alias check. An alias check aliases the health state of another registered node or service.

For aliased services on the same agent, the local state is monitored and no additional network resources consumed. For other services and nodes, the check maintains a blocking query over the agent's connection with a current server and allows stale requests.

The counting service could be used to represent the backend and the dependency for the dashboard service. With an alias check, if the counting service fails, then the dashboard will also be marked as unhealthy and will not be returned by the DNS interface.

{
  "check": {
    "id": "counter-alias",
    "name": "counter_alias",
    "service_id": "dashboard_1",
    "alias_service": "counting_1"
  }
}

Register the checks

The final step is to register your checks. You will manually register the checks to gain a better understanding of the process and the information that your automation tooling will ultimately need to provide Consul in order to take better advantage of service discovery.

Register a node check using the configuration directory

Checks are part of Consul reloadable configuration, you do not need to restart Consul in order to register or modify a check.

To apply the configuration you can follow these steps:

  • Copy the configuration file inside Consul config-dir

    $ cp ./check_definition.json ./consul-conf
    
  • Apply the configuration by issuing consul reload

    $ consul reload
    Configuration reload triggered
    

Check persistence

Checks installed using this method are not persisted in Consul data folder. You can remove the check by removing the check definition file or edit it in case you want to change something in the definition, and run consul reload

Register the counting service and check using the CLI

Consul CLI provides a command to register a service in the catalog using the same definition structure we used during the check creation.

Write the following definition inside a file called service_counting.json.

{
  "service": {
    "ID": "counting_1",
    "name": "counting",
    "tags": ["backend", "counting"],
    "port": 5000,
    "check": {
      "id": "counting_check",
      "name": "Check Counter health 5000",
      "tcp": "localhost:5000",
      "interval": "10s",
      "timeout": "1s"
    }
  }
}

Once the definition is saved you can register the service by running:

$ ./consul services register service_counting.json
Registered service: counting

Deregister the service

In case you need to deregister a service, and the associated check, registered using the CLI you can use the following command:

$ ./consul services register service_counting.json
Deregistered service: counting_1

Register the dashboard service and check using the API

The third option to register a service and a check is via the HTTP API.

Write the following definition inside a file called service_dashboard.json.

{
  "ID": "dashboard_1",
  "name": "dashboard",
  "tags": ["frontend", "counting"],
  "port": 5001,
  "check": {
    "id": "dashboard_check",
    "name": "Check Dashboard health 5001",
    "http": "http://localhost:5001/health",
    "method": "GET",
    "interval": "10s",
    "timeout": "1s"
  }
}

Once the definition is saved you can register the service by running:

$ curl --request PUT --data @service_dashboard_api.json http://127.0.0.1:8500/v1/agent/service/register

Deregister the service

In case you need to deregister a service registered using the API, and the associated check, you can use the following command:

$ curl --request PUT http://127.0.0.1:8500/v1/agent/service/deregister/dashboard_1

Troubleshooting Checks

At this point you should be all set, you registered your checks and hopefully they are healthy. However, as you probably already experienced, reality is much more variable. Here are a few methods for monitoring checks.

Consul UI

The first way to check on the state of your services it to use Consul UI. Consul (if configured using the ui parameter) exposes a web interface by default on port 8500 of the node it is running on. Here is one example of how the UI will look after we registered all services and checks:

UI with three services all healthy and all checks
registered

In case something is going on with your services and the checks start failing the view is going to be less reassuring:

UI with three services and all checks registered with wartnings and
failures

You can click on the different icons to discover which checks are failing and see if the output provides additional information.

Logs

Another place that can help you see what is going on with your checks are the log files. Here is what will be shown in the logs in case you have a check called mem-util in the different states:

  • Passing:

    [DEBUG] agent: Check "mem-util" is passing
    

    you will need to have log-level set to DEBUG or TRACE to see the line in your logs.

  • Warning:

    [WARN] agent: Check "mem-util" is now warning
    
  • Critical

    [WARN] agent: Check "mem-util" is now critical
    

REST API

Both the indicators provided above are not the most accurate way to check the state of your services and they are also not easy to automate. If you want to get a more detailed set of information you can use the REST API:

$ curl --request GET http://127.0.0.1:8500/v1/agent/checks
{
  "counter-alias": {
    "Node": "agent-dc1-1",
    "CheckID": "counter-alias",
    "Name": "counter_alias",
    "Status": "passing",
    "Notes": "",
    "Output": "All checks passing.",
    "ServiceID": "dashboard_1",
    "ServiceName": "dashboard",
    "ServiceTags": [
      "frontend",
      "counting"
    ],
    ...
  },
  "counting_check": {
    "Node": "agent-dc1-1",
    "CheckID": "counting_check",
    "Name": "Check Counter health 5000",
    "Status": "passing",
    "Notes": "",
    "Output": "TCP connect localhost:5000: Success",
    "ServiceID": "counting_1",
    "ServiceName": "counting",
    "ServiceTags": [
      "backend",
      "counting"
    ],
    ...
  },
  "dashboard_check": {
    "Node": "agent-dc1-1",
    "CheckID": "dashboard_check",
    "Name": "Check Dashboard health 5001",
    "Status": "passing",
    "Notes": "",
    "Output": "HTTP GET http://localhost:5001/health: 200 OK Output: Hello, you've hit /health\n",
    "ServiceID": "dashboard_1",
    "ServiceName": "dashboard",
    "ServiceTags": [
      "frontend",
      "counting"
    ],
    ...
  },
  "mem-util": {
    "Node": "agent-dc1-1",
    "CheckID": "mem-util",
    "Name": "Memory utilization",
    "Status": "passing",
    "Notes": "",
    "Output": "22.449\n",
    "ServiceID": "",
    "ServiceName": "",
    "ServiceTags": [],
    "Type": "",
    "Definition": {},
    "CreateIndex": 0,
    "ModifyIndex": 0
  }
}

In case you want to filter the results you can use the filtering section of the REST endpoint.

Different endpoints in the API

Consul exposes several resources to interact with services and checks, the most common used when it comes to registration are:

Summary

In this guide you registered a health with Consul and learned how to leverage the health checks Consul natively provides using the UI and the HTTP API. You can find a complete list of health checks registration fields in the API documentation, or learn more about health checks in the check definition documentation.