Troubleshoot common Consul issues

16min
|
Consul

Troubleshooting is a fundamental skill for any devops practitioner. With that in mind, Consul includes several tools to help you view log messages, validate configuration files, examine the service catalog, and do other debugging.

Tip

The content of this tutorial also applies to Consul clients connecting to Consul clusters hosted on HashiCorp Cloud Platform (HCP).

Troubleshooting steps

Before you start, consider following a pre-determined troubleshooting workflow, which can keep distractions from interfering with finding and fixing the problem. You or your team might already use a workflow such as Observe, Orient, Decide, Act or Rubber duck debugging. We like to use the following process:

Gather data
Verify what works
Solve one problem at a time
Form a hypothesis and test it
(Repeat)

Consul and other tools generate log files, status messages, process lists, data feeds, and other useful information, which you can use when gathering data. We'll go over many of those tools in this tutorial.

Verifying what works is important because Consul sits at the center of other highly complex systems. Making sure that core systems, networking, and services are working as expected can help you narrow down the problem space, and prevent you from spending time troubleshooting the wrong issues. If you are uncertain about how a component works or if it is working at all, list that uncertainty in your notes or take the time to verify that it is operating normally. In this tutorial, we'll discuss some tools that can help you check on outside systems, but you should always consult their documentation as well.

Because Consul is highly configurable, you'll find it easier if you solve one problem at a time, verify successful operation, and then proceed to the next issue. If necessary, build a smaller system where you can test the specific configuration options or features that seem to be operating incorrectly. Once you have verified proper syntax, correct network operation, and fully functioning microservices, then you can integrate those changes back into the main system.

Hypothesis-based testing can help you focus on the one problem you've chosen. Write down a hypothesis (theory) about what might be causing your problem and how it might be fixed. Then observe the data and take action to confirm your theory is correct (and if it fixes the problem).

You may need to repeat some or all of this process to isolate each piece of the problem. By following a consistent process (no matter what it is), you can reduce the likelihood that your process will complicate the situation.

Consul-specific tools

Now that you have a troubleshooting process, let's examine the tools and data available to you while operating Consul. These Consul-specific tools will help you gather data and verify what is already working.

Note

Each of the Consul commands mentioned below can be run on a node with access to the consul agent. Use SSH or other container-specific commands to connect to a compute node in your datacenter.

Review your Consul architecture

When communicating to other members of your team or to support staff, it's helpful to review some details about your Consul architecture, such as:

How are you querying Consul? (DNS, HTTP)
Is the Consul web UI available for viewing?
What system are you using for launching microservices? (systemd, kubernetes, upstart, Nomad, etc.)

Members

The consul members command will list the other servers and agents that are part of the Consul datacenter connected to the current agent.

$ consul members

Node    Address         Status  Type    Build  Protocol  DC   Segment
laptop  127.0.0.1:8301  alive   server  1.4.0  2         dc1  <all>

$ consul members

Node    Address         Status  Type    Build  Protocol  DC   Segment
laptop  127.0.0.1:8301  alive   server  1.4.0  2         dc1  <all>

Consul members documentation

List peers

If you need details other than those provided by members, try the subcommands of consul operator raft. These work with Consul at a lower level but can provide detail about the moment-to-moment status of the datacenter. This will list leader state, voting status, and raft protocol version.

$ consul operator raft list-peers

Node    ID           Address         State   Voter  RaftProtocol
laptop  abc-def-g12  127.0.0.1:8300  leader  true   3

$ consul operator raft list-peers

Node    ID           Address         State   Voter  RaftProtocol
laptop  abc-def-g12  127.0.0.1:8300  leader  true   3

Consul operator raft documentation

Monitor

The consul monitor command displays log output from the Consul agent. Other arguments can increase the amount of information given, such as -log-level debug or -log-level trace for large amounts of log data.

$ consul monitor

2019/01/25 17:48:33 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:abcdef Address:127.0.0.1:8300}]
2019/01/25 17:48:33 [INFO] raft: Node at 127.0.0.1:8300 [Follower] entering Follower state (Leader: "")
2019/01/25 17:48:33 [INFO] serf: EventMemberJoin: laptop.dc1 127.0.0.1
2019/01/25 17:48:33 [INFO] serf: EventMemberJoin: laptop 127.0.0.1
2019/01/25 17:48:33 [INFO] consul: Adding LAN server laptop (Addr: tcp/127.0.0.1:8300) (DC: dc1)
2019/01/25 17:48:33 [INFO] consul: Handled member-join event for server "laptop.dc1" in area "wan"
2019/01/25 17:48:33 [ERR] agent: Failed decoding service file "services/.DS_Store": invalid character '\x00' looking for beginning of value
2019/01/25 17:48:33 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2019/01/25 17:48:33 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2019/01/25 17:48:33 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2019/01/25 17:48:33 [INFO] agent: Started gRPC server on 127.0.0.1:8502 (tcp)

$ consul monitor

2019/01/25 17:48:33 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:abcdef Address:127.0.0.1:8300}]
2019/01/25 17:48:33 [INFO] raft: Node at 127.0.0.1:8300 [Follower] entering Follower state (Leader: "")
2019/01/25 17:48:33 [INFO] serf: EventMemberJoin: laptop.dc1 127.0.0.1
2019/01/25 17:48:33 [INFO] serf: EventMemberJoin: laptop 127.0.0.1
2019/01/25 17:48:33 [INFO] consul: Adding LAN server laptop (Addr: tcp/127.0.0.1:8300) (DC: dc1)
2019/01/25 17:48:33 [INFO] consul: Handled member-join event for server "laptop.dc1" in area "wan"
2019/01/25 17:48:33 [ERR] agent: Failed decoding service file "services/.DS_Store": invalid character '\x00' looking for beginning of value
2019/01/25 17:48:33 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2019/01/25 17:48:33 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2019/01/25 17:48:33 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2019/01/25 17:48:33 [INFO] agent: Started gRPC server on 127.0.0.1:8502 (tcp)

Consul monitor documentation

Validate

The consul validate command can be run on a single Consul configuration file or, more commonly, on an entire directory of configuration files. Basic syntax and logical correctness will be analyzed and reported upon.

This command will catch misspellings of Consul configuration keys, or the absence or misconfiguration of crucial attributes.

$ consul validate /etc/consul.d/counting.json

* invalid config key serrrvice

$ consul validate /etc/consul.d/counting.json

* invalid config key serrrvice

Tip

Run validate on an entire directory of Consul configuration files so that the complete configuration can be analyzed.

Documentation for Consul validate

Debug

The consul debug command can be run on a node without any other arguments. It will log metrics, logs, profiling data, and other data to the current directory for two minutes.

All content is written in plain text to a compressed archive, so do not transmit the emitted data over unencrypted channels.

However, you might find this data useful or can provide it to support staff in order to help you debug your Consul datacenter.

$ consul debug

==> Starting debugger and capturing static information...
     Agent Version: '1.4.0'
          Interval: '30s'
          Duration: '2m0s'
            Output: 'consul-debug-1548721978.tar.gz'
           Capture: 'metrics, logs, pprof, host, agent, cluster'
==> Beginning capture interval 2019-01-28 16:32:58.56142 -0800 PST (0)

$ consul debug

==> Starting debugger and capturing static information...
     Agent Version: '1.4.0'
          Interval: '30s'
          Duration: '2m0s'
            Output: 'consul-debug-1548721978.tar.gz'
           Capture: 'metrics, logs, pprof, host, agent, cluster'
==> Beginning capture interval 2019-01-28 16:32:58.56142 -0800 PST (0)

Unzip the archive.

$ tar xvfz consul-debug-1548721978.tar.gz

$ tar xvfz consul-debug-1548721978.tar.gz

View the contents.

$ tree consul-debug-1548721978

$ tree consul-debug-1548721978

tree consul-debug output

Documentation on Consul debug

Health Checks

When enabled, health checks are a crucial part of the operation of the Consul datacenter. Unhealthy services will not be published for discovery via standard DNS, or some HTTP API calls.

The easiest way to view initial health status is by visiting the Consul Web UI at http://localhost:8500/ui.

Click through to a specific service such as the counting service. The status of the service on each node will be displayed. Click through to inspect the output of the health check.

Alternately, use the HTTP API to view the entire catalog or a specific service. The /v1/agent/services endpoint will return a list of all services registered in the catalog.

$ curl "http://127.0.0.1:8500/v1/agent/services"

$ curl "http://127.0.0.1:8500/v1/agent/services"

Filters can be provided on the query string, such as this command which looks for counting services that are passing (healthy):

$ curl 'http://localhost:8500/v1/health/service/counting?passing'

[
  {
    "Node": {
      "ID": "da8eb9d3-...",
      "Node": "laptop",
      "Address": "127.0.0.1",
      "Datacenter": "dc1",
    },
    "Checks": [
      {
        "Node": "laptop",
        "Output": "Agent alive and reachable"
      },
      {
        "Node": "laptop",
        "Name": "Service 'counting' check",
        "Status": "passing",
        "Output": "HTTP GET http://localhost:9003/health: 200 OK Output: Hello, you've hit /health\n",
        "ServiceName": "counting"
      }
    ]
  }
]

$ curl 'http://localhost:8500/v1/health/service/counting?passing'

[
  {
    "Node": {
      "ID": "da8eb9d3-...",
      "Node": "laptop",
      "Address": "127.0.0.1",
      "Datacenter": "dc1",
    },
    "Checks": [
      {
        "Node": "laptop",
        "Output": "Agent alive and reachable"
      },
      {
        "Node": "laptop",
        "Name": "Service 'counting' check",
        "Status": "passing",
        "Output": "HTTP GET http://localhost:9003/health: 200 OK Output: Hello, you've hit /health\n",
        "ServiceName": "counting"
      }
    ]
  }
]

Another important aspect of health checks is not only the presence of a healthy service, but the continued presence of that healthy service. If a service is oscillating between healthy and unhealthy, we call that flapping. It may be due to problems inherent to the service itself (an internal crash), and the service itself should be investigated (or the process launcher that runs the process, such as systemd).

Also consider the type of health check being run. Consider using built-in health checks such as TCP or HTTP rather than a script check which runs an external shell command to verify service health.

Documentation on Consul health checks

External tools

Consul was designed to work within an established ecosystem of networking protocols which means that you can use existing tools to gather data, verify what is working, and debug the network, applications, and security context surrounding Consul.

ps

Consul service discovery relies on existing process launching tools such as systemd or upstart or init.d. If you choose to use one of these tools, you should refer to their documentation to learn how to write configuration files, start/stop/restart scripts, and monitor the output of processes launched by these tools.

A common tool on Unix systems is ps. Run it to verify that your service processes are running as expected.

$ ps | grep counting

79846 ttys001    0:00.07 ./counting-service

$ ps | grep counting

79846 ttys001    0:00.07 ./counting-service

Or, consider pstree which can be installed separately with your operating system's package manager. It displays a hierarchy of running child processes and parent processes.

$ pstree

 | | \-+= 74259 geoffrey -zsh
 | |   \--= 79846 geoffrey ./counting-service

$ pstree

 | | \-+= 74259 geoffrey -zsh
 | |   \--= 79846 geoffrey ./counting-service

dig

Given that Consul speaks the DNS protocol, standard DNS tools can be used with it. However, Consul operates by default on port 8600 instead of the DNS default of 53.

The dig tool is a command line application that interacts with DNS records of all kinds. To retrieve details about a DNS record in Consul, pass the -p 8600 flag and the IP address of the Consul server with @127.0.0.1.

In this example, we find the IP address of the counting service.

$ dig @127.0.0.1 -p 8600 counting.service.consul

; <<>> DiG 9.10.6 <<>> @127.0.0.1 -p 8600 counting.service.consul ...  ;;
ANSWER SECTION: counting.service.consul. 0  IN  A   192.168.0.35

$ dig @127.0.0.1 -p 8600 counting.service.consul

; <<>> DiG 9.10.6 <<>> @127.0.0.1 -p 8600 counting.service.consul ...  ;;
ANSWER SECTION: counting.service.consul. 0  IN  A   192.168.0.35

Tip

A DNS query to Consul will only return the IP address of a healthy service, not all registered services. Use the HTTP API if you want to retrieve a list of all healthy services, or all registered services regardless of health.

Consul also provides additional information with the SRV (service) argument. Add SRV to the dig command to retrieve the port number that the counting service operates on (shown as port 9003 below).

$ dig @127.0.0.1 -p 8600 counting.service.consul SRV

;; ANSWER SECTION: counting.service.consul. 0   IN  SRV 1 1 9003
Machine.local.node.dc1.consul.

;; ADDITIONAL SECTION: Machine.local.node.dc1.consul. 0 IN  A
192.168.0.35

$ dig @127.0.0.1 -p 8600 counting.service.consul SRV

;; ANSWER SECTION: counting.service.consul. 0   IN  SRV 1 1 9003
Machine.local.node.dc1.consul.

;; ADDITIONAL SECTION: Machine.local.node.dc1.consul. 0 IN  A
192.168.0.35

If you have configured DNS forwarding to integrate system DNS with Consul, you can omit the IP address and port number. However, when debugging it is often useful to be specific so that you can verify that direct communication to the Consul agent is working as expected.

curl

As simple as it may seem, the curl command is extremely useful for debugging any web service or discovering details about content and HTTP headers.

Use curl on your configured health endpoint to verify connectivity to a service. You can provide the IP address or localhost. You can also provide a port number and any other path information or query string data (you can even post form data).

If you have configured DNS forwarding, you may be able to use Consul-specific domain names to communicate to the service (such as http://counting.service.consul).

$ curl http://localhost:9003/health

Hello, you've hit /health

$ curl http://localhost:9003/health

Hello, you've hit /health

Note

If a service uses Consul service mesh prior to version 1.10, you may need to communicate to a service from a node in the datacenter using localhost and the local port as configured with Consul service mesh.

ping

The ping command is a simple tool that verifies network connectivity to a host (if it responds to ping). Use an IP address to verify that one node can communicate to another node.

ping is also useful for viewing the latency between nodes, and the reliability of packet transfer between them.

$ ping 10.0.1.14

PING 10.0.1.14 (10.0.1.14): 56 data bytes 64 bytes from 10.0.1.14: icmp_seq=0
ttl=64 time=0.065 ms 64 bytes from 10.0.1.14: icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from 10.0.1.14: icmp_seq=2 ttl=64 time=0.060 ms ^C --- 10.0.1.14 ping
statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.060/0.064/0.068/0.003 ms

$ ping 10.0.1.14

PING 10.0.1.14 (10.0.1.14): 56 data bytes 64 bytes from 10.0.1.14: icmp_seq=0
ttl=64 time=0.065 ms 64 bytes from 10.0.1.14: icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from 10.0.1.14: icmp_seq=2 ttl=64 time=0.060 ms ^C --- 10.0.1.14 ping
statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.060/0.064/0.068/0.003 ms

Next steps

You should now be able to use the Consul tools to help troubleshoot.

View log messages with consul monitor
Validate configuration files with consul validate
Examine the service catalog with the HTTP API endpoint /v1/agent/services

Finally, you should be able to use external tools like dig and curl to help troubleshoot Consul issues.

TLS encryption

Automated upgrades