Operations

Troubleshooting Vault

Troubleshooting is a fundamental task for Vault operators. However, catching an error with Vault can be a complex exercise; Vault connects to so many other systems that it can be difficult to ascertain what's gone on, but doing so in a timely and efficient manner is of the utmost importance.

Vault may often fire a cluster of errors, and getting to the root of the issue may take some time. There are a few general steps, however, that you can take to gather as much information as possible about the error that's being created, what's responsible for it (Vault, a third party service, the UI, the API, etc.), and then fix it. This article will run through a few general approaches for making sense of Vault errors by reproducing the error, looking at the logs, checking the error's source, and looking at our external resources.

Vault Logs

Vault has two types of logs - Vault server logs and audit logs. The audit logs record every request made to Vault as well as the response sent from Vault. The server logs are operational logs that provide insights into what the server is doing internally and in the background as Vault runs.

Logging is extremely useful when you are troubleshooting because it provides context for the error. You can see the Vault server configuration, as well as the actions Vault tried to take in the moments that precede the error, which provides an insight into fixing it.

Audit Logs

Audit devices are the components in Vault that responsible for managing audit logs. Every request to Vault and response from Vault goes through the configured audit devices. This provides a simple way to integrate Vault with multiple audit logging destinations of different types.

The generated audit log contains every authenticated interaction with Vault including errors. There is an audit log entry for each request and its response, a compressed JSON object that looks like this:

...
{
  "time": "2019-11-05T00:40:27.638711Z",
  "type": "request",
  "auth": {
    "client_token": "hmac-sha256:6291b17ab99eb5bf3fd44a41d3a0bf0213976f26c72d12676b33408459a89885",
    "accessor": "hmac-sha256:2630a7b8e996b0c451db4924f32cec8793d0eb69609f777d89a5c8188a742f52",
    "display_name": "root",
    "policies": [
      "root"
    ],
    "token_policies": [
      "root"
    ],
    "token_type": "service"
  },
  "request": {
    "id": "9adb5544-637f-3d42-9459-3684f5d21996",
    "operation": "update",
    "client_token": "hmac-sha256:6291b17ab99eb5bf3fd44a41d3a0bf0213976f26c72d12676b33408459a89885",
    "client_token_accessor": "hmac-sha256:2630a7b8e996b0c451db4924f32cec8793d0eb69609f777d89a5c8188a742f52",
    "namespace": {
      "id": "root"
    },
    "path": "sys/policies/acl/admin",
    "data": {
      "policy": "hmac-sha256:212744709e5a643a5ff4125160c26983f8dab537f60d166c2fac5b95547abc33"
    },
    "remote_address": "127.0.0.1"
  }
}
{
  "time": "2019-11-05T00:40:27.641524Z",
  "type": "response",
  "auth": {
    "client_token": "hmac-sha256:6291b17ab99eb5bf3fd44a41d3a0bf0213976f26c72d12676b33408459a89885",
    "accessor": "hmac-sha256:2630a7b8e996b0c451db4924f32cec8793d0eb69609f777d89a5c8188a742f52",
    "display_name": "root",
    "policies": [
      "root"
    ],
    "token_policies": [
      "root"
    ],
    "token_type": "service"
  },
  "request": {
    "id": "9adb5544-637f-3d42-9459-3684f5d21996",
    "operation": "update",
    "client_token": "hmac-sha256:6291b17ab99eb5bf3fd44a41d3a0bf0213976f26c72d12676b33408459a89885",
    "client_token_accessor": "hmac-sha256:2630a7b8e996b0c451db4924f32cec8793d0eb69609f777d89a5c8188a742f52",
    "namespace": {
      "id": "root"
    },
    "path": "sys/policies/acl/admin",
    "data": {
      "policy": "hmac-sha256:212744709e5a643a5ff4125160c26983f8dab537f60d166c2fac5b95547abc33"
    },
    "remote_address": "127.0.0.1"
  },
  "response": {}
}
...

NOTE: The log output is formatted using jq for readability.

Notice that the sensitive information such as your client token is obfuscated by default (HMAC-SHA256). This is because Vault prioritizes safety over availability.

Enabling audit devices

When a Vault server is first started, no auditing is enabled. Audit devices must be enabled by a privileged user whose policy must inclue the following rules:

# 'sudo' capability is required to manage audit devices
path "sys/audit/*"
{
  capabilities = ["create", "read", "update", "delete", "list", "sudo"]
}

# To list enabled audit devices, 'sudo' capability is required
path "sys/audit"
{
  capabilities = ["read", "sudo"]
}

To enable an audit device, execute the vault audit enable command.

Example:

The following command enables the audit device, file at the file/ path. The output logs are stored in the /vault/vault-audit.log file.

$ vault audit enable file file_path=/vault/vault-audit.log
Success! Enabled the file audit device at: file/

As a best practice, enable multiple audit devices for your production servers; this way, you have some audit trace even if one of the audit devices becomes unavailable.

You can also use vault audit list -detailed to get the full path for audit device options.

$ vault audit list -detailed
Path      Type    Description    Replication    Options
----      ----    -----------    -----------    -------
file/     file    n/a            replicated     file_path=/vault/vault-audit.log

If an error occurred with your request or response, the error message is included in the error field's value.

...
    "response": {
    "data": {
      "error": "hmac-sha256:eec8e9b7b8d9e4720693c077225f01aeae8ef50bb2e196522041400cc76b8e3c"
    }
  },
  "error": "1 error occurred:\n\t* permission denied\n\n"
...

To quickly find a list of all non-empty and non-null error fields from the log, use the following command (where $AUDIT_LOG_FILE is the actual filename of the Vault audit device log you are analyzing):

$ jq 'select(.error != null) | select(.error != "") | [.time,.error] | @sh' $AUDIT_LOG_FILE

If this command returns nothing, then there are no errors in the log.

Vault Server Logs

When the Vault server is starting up, it logs the configuration information such as listener ports, logging level, storage backend type, and Vault version that you are running.

==> Vault server configuration:

                     Cgo: disabled
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: debug
                   Mlock: supported: false, enabled: false
                 Storage: file
                 Version: Vault v1.2.3+ent

==> Vault server started! Log data will stream in below:

2019-11-04T17:48:44.226-0800 [WARN]  no `api_addr` value specified in config or in VAULT_API_ADDR; falling back to detection if possible, but this value should be manually set
2019-11-04T17:48:44.230-0800 [DEBUG] storage.cache: creating LRU cache: size=0
2019-11-04T17:48:44.231-0800 [DEBUG] cluster listener addresses synthesized: cluster_addresses=[0.0.0.0:8201]
2019-11-04T17:48:58.965-0800 [DEBUG] core: unseal key supplied
2019-11-04T17:48:58.967-0800 [DEBUG] core: starting cluster listeners
2019-11-04T17:48:58.967-0800 [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
2019-11-04T17:48:58.967-0800 [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2019-11-04T17:48:58.968-0800 [INFO]  core: post-unseal setup starting
...

Once the server is started, the rest of the log entries include the time, the log level (e.g., INFO), the log source, and the log message. Even if you can't fix the error, these logs will be invaluable in troubleshooting.

In the server logs, you'll find errors in the log level as ERR , but you may find further context in WARN as well as in the other preceding and surrounding log entries.

Server Log Level

To specify the Vault server's log level, you can do one of the following:

  • Use the -log-level CLI command flag
  • Set in the VAULT_LOG_LEVEL environment variable
  • Specify with log_level parameter in the server configuration file

Supported values (in order of detail) are trace, debug, info, warn, and err. The default log level is info.

  1. Using the CLI command

    When starting the Vault server via CLI, pass the -log-level flag to specify the log level.

    $ vault server -config=/etc/vault/config-file.hcl -log-level=debug
    
  2. VAULT_LOG_LEVEL environment variable

    Set the log level in an environment variable.

    $ export VAULT_LOG_LEVEL=debug
    
  3. Server configuration file

    Specify the log_level parameter in the server configuration file.

    storage "consul" {
      address = "127.0.0.1:8500"
      path    = "vault"
    }
    
    listener "tcp" {
      address     = "127.0.0.1:8200"
      tls_disable = 1
    }
    
    telemetry {
      statsite_address = "127.0.0.1:8125"
      disable_hostname = true
    }
    
    log_level = "Debug"
    

Changing the log level

When you change the log level by editing the server configuration file or the VAULT_LOG_LEVEL environment variable value, the change won't take an effect until the Vault server is restarted. When you have an HA cluster, apply the change on the standby nodes first, and then lastly on the active node. By doing this, you are ensuring that if the active node fails and one of the standby nodes becomes the new active node, it has the desired level of server logs.

HA Cluster

Finding server logs on Linux systems

On modern systemd based Linux distributions, the journald daemon will capture Vault's log output automatically to the system journal. Assuming your Vault service is named vault, use a command like this to retrieve only the Vault-specific log entries from the system journal:

$ journalctl -b --no-pager -u vault
...
Oct 15 17:01:47 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:47.950Z [DEBUG] replication.index.local: saved checkpoint: num_dirty=0
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=auth/token/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=secret/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=sys/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=identity/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=cubbyhole/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.947Z [DEBUG] replication.index.perf: saved checkpoint: num_dirty=0
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.950Z [DEBUG] replication.index.local: saved checkpoint: num_dirty=0

If your Vault systemd service is not named vault or you're unsure of the service name, then you can use a more generic command:

$ journalctl -b | awk '$5 ~ "vault"'

The output should go back to the system boot time and will sometimes also include restarts of Vault. If the output from the above includes log lines prefixed with vault[NNNN]: , then you've found the operational logs.

To package these logs for sharing, you can execute a command such as:

$ journalctl -b --no-pager -u vault | gzip -9 > /tmp/"$(hostname)-$(date +%Y-%m-%dT%H-%M-%SZ)-vault.log.gz"

This will generate a compressed log file in the /tmp directory:

/tmp/ip-10-42-0-27-2018-10-15T17:06:49Z-vault.log.gz

Not finding the server logs?

If you don't find these vault[NNNN]lines in your output, your Vault startup script could be instead sending the log output elsewhere. To find it, take a look into the Vault systemd unit, which is often (but not always) located at /etc/systemd/system/vault.service . If you notice something similar to the following:

...
[Service]
...
ExecStart=/bin/sh -c '/home/vagrant/bin/vault server -config=/home/vagrant/vault_nano/config/vault -log-level="trace" > /var/log/vault.log
...

Then Vault is likely storing its operational logging in the static file located at /var/log/vault.log.

If Vault is not operating on on Linux or is not operating on a systemd based Linux, it could be configured to log to the system log via a facility like logger, and so Vault's logs could be part of the main system logs in these locations:

Docker

Logs from Vault Docker containers can be retrieved with the docker logs command:

$ docker logs vault0

Where vault0 is the container name.

To grab all Vault logs from a container and compress them, use a command line like:

$ docker logs vault0 2>&1 | gzip -9 - > vault0.log.gz

Kubernetes

Logs from Vault Kubernetes pods can be retrieved with the kubectl logs command:

$ kubectl logs vault-55bcb779b4-8mfn6

Where vault-55bcb779b4-8mfn6 is the pod name.

Troubleshoot the storage backend

Vault offers a number of configurable storage options (e.g. Consul, MySQL, etc.) and root cause of Vault failure may be the storage backend.

When Vault encountered an outage, you may need to troubleshoot the storage backend as well.

Troubleshooting approach

Reproduce the bug

Review the Vault configuration and environment as shown in the Vault server logs. If possible, try to reproduce the error in a clean environment and a new vault storage state. Try reproducing the bug as cleanly as possible; some errors in Vault can be temporary.

Source of the error

Determine if the error is coming from the Vault UI or the API, or if it's from Vault or a third-party service. If the issue is observed in the UI, check the network inspector to understand the API call and response. This should help you ascertain it if is an API or a UI error. For example, if an AWS backend is being used, is the error coming from the AWS API?

If it's from Vault, check if the parameters in your request are mentioned in the error at all, then check documentation for those parameters. Remember that the audit logs can provide the insight into every request came into Vault.

During the troubleshooting, you may need the raw audit data with no hashing. To collect the raw data, you can enable an audit device with log_raw=true parameter.

$ vault audit enable -path=file_raw file \
        file_path=/vault/audit-law.log \
        log_raw=true

Reproduce the error to generate the audit log with raw data.

After collecting the information you need, be sure to disable the raw audit:

$ vault audit disable file_raw

Vault policies

When you receive the 403 permission denied error, it is necessary to review the policies. The permission denied errors can often be the result of a path / policy mis-match.

You can use the vault token capabilities command to check allowed operations against a path.

Example:

# Create a token with a policy of question
$ vault token create -policy=webapp

Key                  Value
---                  -----
token                s.IcTMGNOug5Cx3wBqpGvI5X4e
token_accessor       s2FhMCQssibpiGeBzVWhxJmn
token_duration       768h
token_renewable      true
token_policies       ["default" "webapp"]
identity_policies    []
policies             ["default" "webapp"]

# Using the token with policy attached, check the capabilities against a path
$ vault token capabilities s.IcTMGNOug5Cx3wBqpGvI5X4e transit/decrypt/phone-number
deny

This example shows that the client token has no permission (deny) against the transit/decrypt/phone-number path which explains why Vault returned the permission denied error when the application tried to invoke the endpoint.

Search the Vault GitHub and Google Group

Often times, the issue you encountered may be a known issue and perhaps, it's been fixed or a workaround is provided. Search the Vault GitHub repository as well as in our Google Group.

Also, search the Vault Changelog to see if the issue was fixed in the newer version.

If you are comfortable reading the source code, you can search for a particular error string in the Vault repository.

Github repo

Narrowing down to the particular Vault version branch to match the version that you are running may speed up your search.

Vault debug tool

The vault debug command can be executed on a Vault server node for a specific period of time, recording information about the node, its cluster and its host environment. The information collected is packaged and written to the user specified path.

To create a debug package using default duration (2 minutes) and interval (30 seconds) in the current directory capturing all applicable targets, execute the command with no parameter.

$ vault debug

The output name scheme is vault-debug-<time-stamp> which gets written to the current directory. To specify the output location and the file name different from the default, use the -output flag.

To create a debug package with 1 minute interval for 10 minutes, execute the following command:

$ vault debug -interval=1m -duration=10m

The generated debug package contents may look similar to:

# First, untar the file
$ tar xvfz vault-debug-2019-11-06T01-26-54Z.tar.gz

# List the extracted files and folders
$ tree
.
├── vault-debug-2019-11-06T01-26-54Z
│   ├── 2019-11-06T01-26-54Z
│   │   ├── goroutine.prof
│   │   ├── heap.prof
│   │   ├── profile.prof
│   │   └── trace.out
│   ├── 2019-11-06T01-26-59Z
│   │   ├── goroutine.prof
│   │   ├── heap.prof
│   │   ├── profile.prof
│   │   └── trace.out
...
│   ├── config.json
│   ├── host_info.json
│   ├── index.json
│   ├── metrics.json
│   ├── replication_status.json
│   └── server_status.json

Vault Metrics

The debug package contains the Vault metrics data (metric.json).

...
{
  "Labels": {},
  "Name": "vault.runtime.alloc_bytes",
  "Value": 7805160
},
{
  "Labels": {},
  "Name": "vault.runtime.free_count",
  "Value": 386623
},
{
  "Labels": {},
  "Name": "vault.runtime.heap_objects",
  "Value": 46342
},
...

To learn more about these metrics, refer to the Vault Telemetry documentation for the unit of measurement and definition.

Help and Reference