HashiCorp Diagnostic, hcdiag
, is a universal troubleshooting and data gathering tool for HashiCorp products that helps you collect and archive important data about your Nomad cluster. You can share the information that hcdiag
gathers with your team or the HashiCorp support teams during incident response and troubleshooting.
»Prerequisites
»Create a Nomad cluster
In this scenario, a Nomad cluster consisting of one node will be created. This node will run the Nomad agent in development mode inside of a Docker container and perform both the server and client functions. Development mode is great for quickly starting a cluster to test configurations or prototype interactions. The hcdiag
information in this tutorial can be used to troubleshoot and report on any Nomad cluster.
»Setup the environment
Run a ubuntu
Docker container in detached mode with the -d
flag. The --rm
flag instructs Docker to delete the container once it has been stopped and the -t
flag allocates a pseudo-tty which keeps the container running until it is stopped manually.
Open an interactive shell into the container with the -it
flags.
Note: Your terminal prompt will now appear differently to show that you are in a shell in the Ubuntu container - root@7c267a923930:/#
. The rest of the commands in the tutorial are to be run in this Ubuntu container shell.
Update apt-get
and install the necessary dependencies.
Create a working directory and change into it.
»Install and Start Nomad
Create a variable for the Nomad version.
Download and install the nomad
binary.
Unzip the package, move the binary to a directory on the system path, and delete the zip file.
Create an agent configuration file named nomad.hcl
and enable ACLs.
Run the nomad
agent in dev mode as a background process. This may take a few seconds.
Bootstrap the ACLs and save the management token SecretID to a file.
Set the NOMAD_TOKEN
variable to use the management token.
Test connectivity to the cluster by running a nomad
status command.
»Install and Run the hcdiag
tool
Create a variable for the hcdiag tool version.
Download the hcdiag
binary.
Unzip the package, move the binary to a directory on the system path, and delete the zip file.
Run hcdiag
against the Nomad cluster. This may take a few minutes.
»Production usage tips
By default, the hcdiag tool includes files for up to 72 hours back from the current time. You can specify the desired time range using the -include-since
flag.
If you are concerned about impacting performance of your Nomad servers, you can specify that the seekers to not run concurrently, and instead be invoked serially with the -serial
flag.
Deploying hcdiag in production involves a workflow similar to the following:
Place the
hcdiag
binary on the Nomad system in scope - this could be a Nomad server or a Nomad client.When running with a configuration file and the
-config
flag, ensure that the specified configuration file is readable by the user that executeshcdiag
.Ensure that the current directory or that specified by the
dest
flag is writable by the user that executeshcdiag
.Ensure connectivity to the HashiCorp products that
hcdiag
needs to connect to during the run. Export any required environment variables for establishing connection or passing authentication tokens as necessary.Decide on a duration for information gathering, noting that the default is to gather for up to 72 hours back in server log output. Adjust your needs as necessary with the
-include-since
flag. For example, to include only 24 hours of log output, invoke as:Limit what is gathered with the
-includes
flag. For example,-includes /var/log/consul-*,/var/log/nomad-*
instructshcdiag
to only gather logs matching the specified Consul and Nomad filename patterns.Use the
-dryrun
flag to observe what hcdiag will do without anything actually being done for testing configuration and options.
»Examine the results
hcdiag
generates an archive file with the troubleshooting data about the cluster in the current working directory.
Extract the archive.
Note: The extracted directory uses a timestamp as part of the filename. This means any references to it used in this tutorial will be different than what you will see on your local machine.
Navigate to the directory of the same name.
The directory contains the Manifest.json
file, which includes information about the hcdiag
run, including start and end time, duration, number of errors encountered, and the configuration options used.
The directory also contains the Results.json
file, which includes detailed information about the cluster, the nodes and their configurations, and other details about the environment. The example below has been snipped from the original output.
Finally, the directory contains a sub-directory named nomad-debug-{TIMESTAMP}
, which includes additional information about the cluster, clients, servers, and job-related components.
»Additional Notes
hcdiag
can also be run against an existing cluster by setting the appropriate environment variables on the machine running the tool. To do so, set the NOMAD_ADDR
environment variable to the address of a server in the cluster and NOMAD_TOKEN
to a token's SecretID with proper access if ACLs are enabled. The machine also needs to have the nomad
binary available in the environment path.
»About ACLs
To complete a full diagnostic successfully with ACLs enabled, hcdiag
should to be run with the management token. This is because one of the endpoints it queries is /v1/operator/raft/configuration
, which explicitly requires the management token. Without that token, hcdiag
will print a warning message in the output that references a 403 Forbidden
error and skip the raft configuration endpoint.
Despite this warning, hcdiag
can still be used as long as the token set in NOMAD_TOKEN
has read permissions on the /agent
, /nodes
, /operator
, and /plugins
endpoints. The results will just be missing diagnostic information from the raft configuration endpoint.
The following policy can be used to grant the necessary permissions to the token.
»Cleanup
Exit the Ubuntu container to return to your terminal prompt.
Stop the Docker container. It will automatically be deleted because of the -rm
flag passed to the docker run
command used in the beginning of the tutorial.
»Next Steps
In this tutorial you learned about the hcdiag
tool, how to use it to gather information about your Nomad cluster, and tips for using it in production.
For additional information about the tool, check out the the hcdiag
GitHub repository.
There are also hcdiag
guides for other HashiCorp tools including Vault, Terraform, and Consul.