Day 2: Advanced Operations and Maintenance

Consul Cluster Monitoring and Metrics

After setting up your first datacenter, it is an ideal time to make sure your cluster is healthy and establish a baseline. This guide will cover several types of metrics in two sections: Consul health and server health.

Consul health:

  • Transaction timing
  • Leadership changes
  • Autopilot
  • Garbage collection

Server health:

  • File descriptors
  • CPU usage
  • Network activity
  • Disk activity
  • Memory usage

For each type of metric, we will review their importance and help identify when a metric is indicating a healthy or unhealthy state.

First, we need to understand the three methods for collecting metrics. We will briefly cover using SIGUSR1, the HTTP API, and telemetry.

Before starting this guide, we recommend configuring ACLs.

How to Collect Metrics

There are three methods for collecting metrics. The first, and simplest, is to use SIGUSR1 for a one-time dump of current telemetry values. The second method is to get a similar one-time dump using the HTTP API. The third method, and the one most commonly used for long-term monitoring, is to enable telemetry in the Consul configuration file.

SIGUSR1 for Local Use

To get a one-time dump of current metric values, we can send the SIGUSR1 signal to the Consul process.

$ kill -USR1 <process_id>

This will send the output to the system logs, such as /var/log/messages or to journald. If you are monitoring the Consul process in the terminal via consul monitor, you will see the metrics in the output.

Although this is the easiest way to get a quick read of a single Consul agent's health, it is much more useful to look at how the values change over time.

API GET Request

Next let's use the HTTP API to quickly collect metrics with curl.

$ curl http://127.0.0.1:8500/v1/agent/metrics

In production you will want to set up credentials with an ACL token and enable TLS for secure communications. Once ACLs have been configured, you can pass a token with the request.

$ curl \
    --header "X-Consul-Token: <YOUR_ACL_TOKEN>" \
    https://127.0.0.1:8500/v1/agent/metrics

In addition to being a good way to quickly collect metrics, it can be added to a script or it can be used with monitoring agents that support HTTP scraping, such as Prometheus, to visualize the data.

Enable Telemetry

Finally, Consul can be configured to send telemetry data to a remote monitoring system. This allows you to monitor the health of agents over time, spot trends, and plan for future needs. You will need a monitoring agent and console for this.

Consul supports the following telemetry agents:

  • Circonus
  • DataDog (via dogstatsd)
  • StatsD (via statsd, statsite, telegraf, etc.)

If you are using StatsD, you will also need a compatible database and server, such as Grafana, Chronograf, or Prometheus.

Telemetry can be enabled in the agent configuration file, for example server.hcl. Telemetry can be enabled on any agent, client or server. Normally, you would at least enable it on all the servers (both voting and non-voting) to monitor the health of the entire cluster.

An example snippet of server.hcl to send telemetry to DataDog looks like this:

  "telemetry": {
    "dogstatsd_addr": "localhost:8125",
    "disable_hostname": true
  }

When enabling telemetry on an existing cluster, the Consul process will need to be reloaded. This can be done with consul reload or kill -HUP <process_id>. It is recommended to reload the servers one at a time, starting with the non-leaders.

Consul Health

The Consul health metrics reveal information about the Consul cluster. They include performance metrics for the key value store, transactions, raft, leadership changes, autopilot tuning, and garbage collection.

Transaction Timing

The following metrics indicate how long it takes to complete write operations in various parts, including Consul KV and Raft from the Consul server. Generally, these values should remain reasonably consistent and no more than a few milliseconds each.

Metric NameDescription
consul.kvs.applyMeasures the time it takes to complete an update to the KV store.
consul.txn.applyMeasures the time spent applying a transaction operation.
consul.raft.applyCounts the number of Raft transactions occurring over the interval.
consul.raft.commitTimeMeasures the time it takes to commit a new entry to the Raft log on the leader.

Sudden changes in any of the timing values could be due to unexpected load on the Consul servers or due to problems on the hosts themselves. Specifically, if any of these metrics deviate more than 50% from the baseline over the previous hour, this indicates an issue. Below are examples of healthy transaction metrics.

'consul.raft.apply': Count: 1 Sum: 1.000 LastUpdated: 2018-11-16 10:55:03.673805766 -0600 CST m=+97598.238246167
'consul.raft.commitTime': Count: 1 Sum: 0.017 LastUpdated: 2018-11-16 10:55:03.673840104 -0600 CST m=+97598.238280505

Leadership Changes

In a healthy environment, your Consul cluster should have a stable leader. There shouldn't be any leadership changes unless you manually change leadership (by taking a server out of the cluster, for example). If there are unexpected elections or leadership changes, you should investigate possible network issues between the Consul servers. Another possible cause could be that the Consul servers are unable to keep up with the transaction load.

Note: These metrics are reported by the follower nodes, not by the leader.

Metric NameDescription
consul.raft.leader.lastContactMeasures the time since the leader was last able to contact the follower nodes when checking its leader lease.
consul.raft.state.candidateIncrements when a Consul server starts an election process.
consul.raft.state.leaderIncrements when a Consul server becomes a leader.

If the candidate or leader metrics are greater than 0 or the lastContact metric is greater than 200ms, you should look into one of the possible causes described above. Below are examples of healthy leadership metrics.

'consul.raft.leader.lastContact': Count: 4 Min: 10.000 Mean: 31.000 Max: 50.000 Stddev: 17.088 Sum: 124.000 LastUpdated: 2018-12-17 22:06:08.872973122 +0000 UTC m=+3553.639379498
'consul.raft.state.leader': Count: 1 Sum: 1.000 LastUpdated: 2018-12-17 22:05:49.104580236 +0000 UTC m=+3533.870986584
'consul.raft.state.candidate': Count: 1 Sum: 1.000 LastUpdated: 2018-12-17 22:05:49.097186444 +0000 UTC m=+3533.863592815

Autopilot

The autopilot metric is a boolean. A value of 1 indicates a healthy cluster and 0 indicates an unhealthy state.

Metric NameDescription
consul.autopilot.healthyTracks the overall health of the local server cluster. If all servers are considered healthy by autopilot, this will be set to 1. If any are unhealthy, this will be 0.

An alert should be setup for a returned value of 0. Below is an example of a healthy cluster according to the autopilot metric.

[2018-12-17 13:03:40 -0500 EST][G] 'consul.autopilot.healthy': 1.000

Garbage Collection

Garbage collection (GC) pauses are a "stop-the-world" event, all runtime threads are blocked until GC completes. In a healthy environment these pauses should only last a few nanoseconds. If memory usage is high, the Go runtime may start the GC process so frequently that it will slow down Consul. You might observe more frequent leader elections or longer write times.

Metric NameDescription
consul.runtime.total_gc_pause_nsNumber of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started.

If the value return is more than 2 seconds/minute, you should start investigating the cause. If it exceeds 5 seconds per minute, you should consider the cluster to be in a critical state and start ensuring failure recovery procedures are up-to-date and start investigating. Below is an example of healthy GC pause.

'consul.runtime.total_gc_pause_ns': 136603664.000

Note, total_gc_pause_ns is a cumulative counter, so in order to calculate rates, such as GC/minute, you will need to apply a function such as non_negative_difference.

Server Health

The server metrics provide information about the health of your cluster including file handles, CPU usage, network activity, disk activity, and memory usage.

File Descriptors

The majority of Consul operations require a file descriptor handle, including receiving a connection from another host, sending data between servers, and writing snapshots to disk. If Consul runs out of handles, it will stop accepting connections.

Metric NameDescription
linux_sysctl_fs.file-nrNumber of file handles being used across all processes on the host.
linux_sysctl_fs.file-maxTotal number of available file handles.

By default, process and kernel limits are conservative, you may want to increase the limits beyond the defaults. If the linux_sysctl_fs.file-nr value exceeds 80% of linux_sysctl_fs.file-max, the file handles should be increased. Below is an example of a file handle metric.

linux_sysctl_fs, host=statsbox, file-nr=768i, file-max=96763i

CPU Usage

Consul should not be demanding of CPU time on either server or clients. A spike in CPU usage could indicate too many operations taking place at once.

Metric NameDescription
cpu.user_cpuPercentage of CPU being used by user processes (such as Vault or Consul).
cpu.iowait_cpuPercentage of CPU time spent waiting for I/O tasks to complete.

If cpu.iowait_cpu is greater than 10%, it should be considered critical as Consul is waiting for data to be written to disk. This could be a sign that Raft is writing snapshots to disk too often. Below is an example of a healthy CPU metric.

cpu, cpu=cpu-total, usage_idle=99.298, usage_user=0.400, usage_system=0.300, usage_iowait=0, usage_steal=0

Network Activity

Network activity should be consistent. A sudden spike in network traffic to Consul might be the result of a misconfigured client, such as Vault, that is causing too many requests.

Most agents will report separate metrics for each network interface, so be sure you are monitoring the right one.

Metric NameDescription
net.bytes_recvBytes received on each network interface.
net.bytes_sentBytes transmitted on each network interface.

Sudden increases to the net metrics, greater than 50% deviation from baseline, indicates too many requests that are not being handled. Below is an example of a network activity metric.

net, interface=enp0s5, bytes_sent=6183357i, bytes_recv=262313256i

Note: The net metrics are counters, so in order to calculate rates, such as bytes/second, you will need to apply a function such as non_negative_difference.

Disk Activity

Normally, there is low disk activity, because Consul keeps everything in memory. If the Consul host is writing a large amount of data to disk, it could mean that Consul is under heavy write load and consequently is checkpointing Raft snapshots to disk frequently. It could also mean that debug/trace logging has accidentally been enabled in production, which can impact performance.

Metric NameDescription
diskio.read_bytesBytes read from each block device.
diskio.write_bytesBytes written to each block device.
diskio.read_timeTime spent reading from disk, in cumulative milliseconds.
diskio.write_timeTime spent writing to disk, in cumulative milliseconds.

Sudden, large changes to the diskio metrics, greater than 50% deviation from baseline or more than 3 standard deviations from baseline indicates Consul has too much disk I/O. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete. Below are examples of disk activity metrics.

diskio, name=sda5, read_bytes=522298368i,  write_bytes=1726865408i, read_time=7248i, write_time=133364i

Note: The diskio metrics are counters, so in order to calculate rates (such as bytes/second),you will need to apply a function such as [non_negative_difference][].

Memory Usage

As noted previously, Consul keeps all of its data -- the KV store, the catalog, etc -- in memory. If Consul consumes all available memory, it will crash. You should monitor total available RAM to make sure some RAM is available for other system processes and swap usage should remain at 0% for best performance.

Metric NameDescription
consul.runtime.alloc_bytesMeasures the number of bytes allocated by the Consul process.
consul.runtime.sys_bytesThe total number of bytes of memory obtained from the OS.
mem.totalTotal amount of physical memory (RAM) available on the server.
mem.used_percentPercentage of physical memory in use.
swap.used_percentPercentage of swap space in use.

Consul servers are running low on memory if sys_bytes exceeds 90% of total_bytes, mem.used_percent is over 90%, or swap.used_percent is greater than 0. You should increase the memory available to Consul if any of these three conditions are met. Below are examples of memory usage metrics.

'consul.runtime.alloc_bytes': 11199928.000
'consul.runtime.sys_bytes': 24627448.000
mem,  used_percent=31.492,  total=1036312576i
swap, used_percent=1.343

Summary

In this guide we reviewed the three methods for collecting metrics. SIGUSR1 and agent HTTP API are both quick methods for collecting metrics, but enabling telemetry is the best method for moving data into monitoring software. Additionally, we outlined the various metrics collected and their significance.