Virtual Event
Join us for the next HashiConf Digital October 12-15, 2020 Register for Free

Monitoring & Troubleshooting

Monitor Telemetry & Audit Device Log Data with Splunk

»Challenge

It is important to gain operational and usage insight into a running Vault cluster for the purposes of understanding performance and assisting with proactive incident response, along with understanding business workloads, and use cases.

  • Operators and security practitioners need to be aware of conditions that can indicate potential performance implications to production users or security issues which require immediate attention.
  • Business users concerned with charges or billing must be aware of specific usage regarding the counts of resources like dynamic secrets or their leases.

»Solution

Vault provides rich operational telemetry metrics that can be consumed by popular solutions for monitoring and alerting on key operational conditions and audit devices for logging each Vault request and response.

Using the Vault telemetry and audit device features in combination with metrics and log aggregation agents in concert with an analysis amd monitoring solution can provide the necessary insight in Vault operations and usage.

Here, you will learn about important metrics to monitor, action steps for responding to anomalies with specific metrics.

You can also find guidance in the form of a practical example configuration that you can try in your own environment or in an online tutorial that builds out this environment for you.


Table of contents:


»Monitoring approaches

There are 3 common approaches that you can use to monitor the health of an application like Vault.

  1. Time-series telemetry data involves capturing metrics from the application, storing them in a special database or index, and analyzing trends in the data over time. Examples: Splunk, Grafana, CloudWatch, DataDog, Circonus
  2. Log analytics relates to capturing log streams from the system and the application, extracting useful signals from the data, and then further analyzing the results. Examples: Splunk, Elasticsearch, SumoLogic
  3. Active health checks use active methods of connecting to the application and interacting with it to ensure it is responding properly. Examples: Consul, Nagios, Sensu, Keynote

All of these methods have their place in a comprehensive monitoring solution, but the focus here is on the capture and analysis of time-series telemetry metrics along with audit device log request and response data.

»Available time-series monitoring solutions

Vault and Consul use the go-metrics package internally to export telemetry, and currently share some of its supported agent solutions as sinks:

Once metrics reach an agent, they typically need to then be forwarded to a storage solution for analysis.

Some of the more popular tools for this portion of the monitoring stack are detailed in the following sections.

»Graphite and Grafana

Graphite is an open-source tool for storing and graphing time-series data. It does not support dashboards or alerts, but Grafana can be used in conjunction with Graphite to provide those features.

»Telegraf, InfluxDB, Chronograf & Kapacitor

Telegraf, InfluxDB, Chronograf, and Kapacitor- a monitoring solution that is commonly known as the TICK stack. Together, these 4 tools provide a full solution for storing, displaying, and alerting on time-series data.

This solution is available in both open-source and commercial versions from InfluxData.

»Amazon CloudWatch

CloudWatch is Amazon's solution for monitoring AWS cloud resources. It handles both time-series data and log files. If you are running Vault and Consul in AWS, it can be an easy choice to make.

One limitation of CloudWatch is that time-series data is only available at a 1-minute granularity and only for 15 days. After that, the data is rolled up into 5-minute and one-hour buckets. For more details, see the CloudWatch FAQs.

»Prometheus

Prometheus is a modern alternative to statsd-compatible daemons, using lightweight HTTP servers called "exporters" which are then scraped by a Prometheus server. Prometheus is increasingly popular in the containerized world.

Rather than the UDP-based push mechanism used by statsd, Prometheus relies on lightweight HTTP servers called "exporters" which collect the metrics that are then scraped by a Prometheus server.

»DataDog

DataDog is a commercial software as a service solution. They provide a customized statsd agent DogStatsd, that includes several vendor-specific extensions such as tagging and service check results.

If you use DataDog, you would use their DogStatsd instead of a tool like Telegraf.

»Splunk and Telegraf

There are numerous commercial and open-source choices, but configuring those solutions is beyond the scope of what you will learn here.

Instead, you will learn from a practical example monitoring solution based on Splunk, Fluentd, and Telegraf. Complete steps for configuration and an example dashboards to get you started are provided.

Vault Enterprise users can go even further with access to a Splunk app that features a rich variety of predefined dashboards.

Before diving into the practical example, you should take time to carefully review the following sections, which present important operational and usage metrics from both Vault and Consul.

»Understanding metrics and audit device data

The next 3 sections detail the information that you can get from Vault operational metrics, usage metrics, and audit device log data.

These sections are good for reference if you are already using a monitoring stack and would like to identify the critical data to monitor, and also to familiarize you with them if they are new to you.

»Vault operational metrics

The following are critical Vault operational metrics from Vault telemetry and from the Telegraf agent itself related to overall server health and system-level performance.

These metrics are the most useful for ops teams to monitor and alert on in production deployments.

»Seal status Consul health check

This metric is formatted as:

consul_health_checks[check_name="Vault Sealed Status"].passing

For this metric, a value of 1 indicates Vault is unsealed, whereas 0 means that Vault is sealed.

Why it is important:

By default, Vault is sealed on startup, so if this value changes to 0 during the day, Vault has restarted for some reason. And until it's unsealed, it won't answer requests from clients.

What to look for:

A value of 0 being reported by any host.

»CPU metrics

These metrics represent system level CPU measurements that are provided by the Telegraf agent.

»cpu.usage_user

Metric sourceDescription
TelegrafThis metric represents the percentage of CPU being used by user processes, such as Vault or Consul.

»cpu.usage_iowait

Metric sourceDescription
TelegrafThis metric represents the percentage of CPU time spent waiting for I/O tasks to complete.

Why it is important:

Encryption can place a heavy demand on the CPU. If the CPU is too busy, Vault may have trouble keeping up with the incoming request load. You may also want to monitor each CPU individually to make sure requests are evenly balanced across all CPUs.

What to look for:

If cpu.iowait_cpu greater than 10%

»Network metrics

These metrics represent system level network measurements that are provided by the Telegraf agent.

»net.bytes_recv

Metric sourceDescription
TelegrafThis metric represents the bytes received on each network interface.

»net.bytes_sent

Metric sourceDescription
TelegrafThis metric represents the bytes transmitted on each network interface.

Why it is important:

A sudden spike in network traffic to Vault might be the result of an anomalous client causing too many requests, or additional load you did not plan for.

What to look for:

Sudden large changes to the net metrics (greater than 50% deviation from baseline).

»Memory usage

These metrics represent both system level memory measurements that are provided by the Telegraf agent and Vault specific memory measurements that are provided as part of the Vault runtime.

»mem.total

Metric sourceDescription
TelegrafThis metric represents the total amount of physical memory (RAM) available on the server.

»mem.used_percent

Metric sourceDescription
TelegrafThis metric represents the percentage of physical memory in use.

»vault.runtime.alloc_bytes

Metric sourceDescriptionUnitType
VaultThis metric represents the number of bytes allocated by the vault process.bytesummary
  • vault.runtime.alloc_bytes.value provides the value.

»vault.runtime.sys_bytes

Metric sourceDescriptionUnitType
VaultThis metric represents the total number of bytes of memory obtained from the OS by the vault process.bytesummary
  • vault.runtime.sys_bytes.value provides the value.

»vault.runtime.num_goroutines

Metric sourceDescriptionUnitType
VaultThis metric represents the number of goroutines associated with the vault process. This metric can serve as a general system load indicator and is worth establishing a baseline and thresholds for alerting.goroutinessummary
  • vault.runtime.num_goroutines.value provides the value.

Why it is important:

Blocked goroutines can increase memory usage and slow garbage collection.

»swap.used_percent

Metric sourceDescription
TelegrafThis metric represents the percentage of swap space in use.

Why it is important:

Vault requires sufficient memory to hold its working data set and if it exhausts available memory it can crash. You should also monitor total available memory to make sure some memory is available for other processes, and swap usage should remain at 0% for best performance.

What to look for:

If sys_bytes exceeds 90% of total_bytes, if mem.used_percent is over 90%, or if swap.used_percent is greater than 0

»Garbage collection metrics

These metrics represent garbage collection related measurements that are provided by the Vault runtime.

»vault.runtime.gc_pause_ns

Metric sourceDescriptionUnitType
VaultThis metric represents the number of nanoseconds consumed by garbage collection (GC) pauses since Vault started.nanosecondsample
  • vault.runtime.gc_pause_ns.count provides a count of GC pauses.
  • vault.runtime.gc_pause_ns.lower provides the lower bound for time taken by GC pauses.
  • Use vault.runtime.gc_pause_ns.mean provides the mean for time taken by GC pauses.
  • vault.runtime.gc_pause_ns.stddev provides the standard deviation for time taken by GC pauses.
  • vault.runtime.gc_pause_ns.sum provides the sum of time taken by GC pauses.
  • vault.runtime.gc_pause_ns.upper provides the upper bound for time taken by GC pauses.

Why it is important:

As mentioned above, GC pause is a stop-the-world event, meaning that all runtime threads are blocked until GC completes. Normally these pauses last only a few nanoseconds. But if memory usage is high, the Go runtime may GC so frequently that it starts to slow down Vault.

What to look for:

Warning if total_gc_pause_ns exceeds 2 seconds/minute, critical if it exceeds 5 seconds/minute

»Disk metrics

These metrics represent system level disk measurements that are provided by the Telegraf agent.

»diskio.read_bytes

Metric sourceDescription
TelegrafThis metric represents bytes read from each block device.

»diskio.write_bytes

Metric sourceDescription
TelegrafThis metric represents bytes written to each block device.

»disk.used_percent

Metric sourceDescription
TelegrafThis metric represents per-mount-point block device utilization.

Why it is important:

When using integrated storage, Vault disk I/O performance becomes a more critical factor and proactive monitoring and alerting on disk performance for Vault servers is crucial.

When using storage backends other than integrated storage, Vault generally doesn't require too much disk I/O, so a sudden change in disk activity could mean that debug or trace logging has accidentally been enabled in production, which can impact performance.

Too much disk I/O can cause the rest of the system to slow down or become unavailable as the kernel spends all its time waiting for I/O to complete.

What to look for:

Sudden large changes to the diskio metrics (greater than 50% deviation from baseline, or more than 3 standard deviations from baseline). Over 80% utilization on block device mount points on which Vault data are persisted.

»Audit device related metrics

These are critical Vault metrics, and can often provide a first alert that an audit device log is blocked.

»vault.audit.file/.log_request

Metric sourceDescriptionUnitType
VaultThis metric represents a count of requests to an enabled file audit device.mssummary
  • vault.audit.file/.log_request.count provides a count of audit device requests.
  • vault.audit.file/.log_request.lower provides the lower bound for time taken by audit device requests.
  • Use vault.audit.file/.log_request.mean provides the mean for time taken by audit device requests.
  • vault.audit.file/.log_request.stddev provides the standard deviation for time taken by audit device requests.
  • vault.audit.file/.log_request.sum provides the sum of time taken by audit device requests.
  • vault.audit.file/.log_request.upper provides the upper bound for time taken by audit device requests. | ms | summary |

»vault.audit.file/.log_response

Metric sourceDescriptionUnitType
VaultThis metric represents a count of responses to log requests specifically to an enabled file audit device.mssummary
  • vault.audit.file/.log_response.count provides a count of audit device responses.
  • vault.audit.file/.log_response.lower provides the lower bound for time taken by audit device responses.
  • Use vault.audit.file/.log_response.mean provides the mean for time taken by audit device responses.
  • vault.audit.file/.log_response.stddev provides the standard deviation for time taken by audit device responses.
  • vault.audit.file/.log_response.sum provides the sum of time taken by audit device responses.
  • vault.audit.file/.log_response.upper provides the upper bound for time taken by audit device responses. | ms | summary |

»vault.audit.log_request

Metric sourceDescriptionUnitType
VaultThis metric represents a count of requests specifically to an enabled file audit device.mssummary
  • vault.audit.log_request.count provides a count of audit device requests.
  • vault.audit.log_request.lower provides the lower bound for time taken by audit device requests.
  • Use vault.audit.log_request.mean provides the mean for time taken by audit device requests.
  • vault.audit.log_request.stddev provides the standard deviation for time taken by audit device requests.
  • vault.audit.log_request.sum provides the sum of time taken by audit device requests.
  • vault.audit.log_request.upper provides the upper bound for time taken by audit device requests.

»vault.audit.log_response

Metric sourceDescriptionUnitType
VaultThis metric represents a count of responses to log requests to an enabled audit device.mssummary
  • vault.audit.log_response.count provides a count of audit device responses.
  • vault.audit.log_response.lower provides the lower bound for time taken by audit device responses.
  • Use vault.audit.log_response.mean provides the mean for time taken by audit device responses.
  • vault.audit.log_response.stddev provides the standard deviation for time taken by audit device responses.
  • vault.audit.log_response.sum provides the sum of time taken by audit device responses.
  • vault.audit.log_response.upper provides the upper bound for time taken by audit device responses.

»vault.audit.log_request_failure

Metric sourceDescriptionUnitType
VaultThis metric represents a count of failed attempts to log requests to an enabled audit device.failurescounter
  • vault.audit.log_request_failure.value provides the number of audit device log request failures since startup.

»vault.audit.log_response_failure

Metric sourceDescriptionUnitType
VaultThis metric represents a count of failed attempts to log responses to an enabled audit device.failurescounter
  • vault.audit.log_response_failure.value provides the number of audit device log response failures since startup.

Why it is important:

These metrics are of utmost importance as a blocked audit device can cause Vault to deliberately stop servicing requests. Review the Blocked Audit Devices documentation for more information.

»Request handling metrics

These metrics represent counts and measurements that are provided by Vault core request handlers.

»vault.core.handle_request

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of requests handled by Vault core.mssummary
  • vault.core.handle_request.count provides a count of requests handled by core.
  • vault.core.handle_request.lower provides the lower bound for time taken by requests handled by core.
  • Use vault.core.handle_request.mean provides the mean for time taken by requests handled by core.
  • vault.core.handle_request.stddev provides the standard deviation for time taken by requests handled by core.
  • vault.core.handle_request.sum provides the sum of time taken by requests handled by core.
  • vault.core.handle_request.upper provides the upper bound for time taken by requests handled by core.

Why it is important:

This is a key measure of Vault's response time or number of requests.

What to look for:

Changes to the count or mean fields that exceed 50% of baseline values, or more than 3 standard deviations above baseline.

»vault.core.handle_login_request

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of login requests handled by Vault core.mssummary
  • vault.core.handle_login_request.count provides a count of login requests handled by core.
  • vault.core.handle_login_request.lower provides the lower bound for time taken by login requests handled by core.
  • Use vault.core.handle_login_request.mean provides the mean for time taken by login requests handled by core.
  • vault.core.handle_login_request.stddev provides the standard deviation for time taken by login requests handled by core.
  • vault.core.handle_login_request.sum provides the sum of time taken by login requests handled by core.
  • vault.core.handle_login_request.upper provides the upper bound for time taken by login requests handled by core.

Why it is important:

This is a key measure of Vault's user login response time or number of login requests.

What to look for:

Changes to the count or mean fields that exceed 50% of baseline values, or more than 3 standard deviations above baseline.

»Route-specific metrics

Vault also provides metrics about operations against specific routes, including those in use by enabled secrets engines.

»vault.route.<operation>.<mount>

The general format of the route based metrics is as follows:

vault.route.<operation>.<mount point>

These Vault metrics represent the time to handle an operation by a particular mount point. Instead of labels, there is one metric per operation/mount pair.

If measured in telemetry metrics, you can gain a good approximation of response time per API endpoint. The metric originates from slightly lower than HTTP, auditing and some common request processing in the Vault stack, but it's at a higher level than any per-API handling.

The following are some specific examples taken from live metrics.

A graph that displays the mean time for rollback operations on the system backend paths.

vault.route.rollback.sys-.mean

Displays the mean time for rollback operations against the system backend.

Another example is vault.route.read.auth-token-, which represents the authentication token read route; here are the available values:

  • vault.route.read.auth-token-.count provides a count of authentication token reads.
  • vault.route.read.auth-token-.lower provides the lower bound for time taken by authentication token reads.
  • Use vault.route.read.auth-token-.mean provides the mean for time taken by authentication token reads.
  • vault.route.read.auth-token-.stddev provides the standard deviation for time taken by authentication token reads.
  • vault.route.read.auth-token-.sum provides the sum of time taken by authentication token reads.
  • vault.route.read.auth-token-.upper provides the upper bound for time taken by authentication token reads.

»Leadership metrics

These are critical operational metrics related to Vault cluster leadership changes, and can help you spot an unhealthy cluster or leadership flapping condition.

»vault.core.leadership_setup_failed

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time taken by cluster leadership setup failures which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status.mssummary
  • vault.core.leadership_setup_failed.lower provides the lower bound for time taken by cluster leadership setup failures.
  • Use vault.core.leadership_setup_failed.mean provides the mean for time taken by cluster leadership setup failures.
  • vault.core.leadership_setup_failed.stddev provides the standard deviation for time taken by cluster leadership setup failures.
  • vault.core.leadership_setup_failed.sum provides the sum of time taken by cluster leadership setup failures.
  • vault.core.leadership_setup_failed.upper provides the upper bound for time taken by cluster leadership setup failures.

»vault.core.leadership_lost

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time taken by cluster leadership losses which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status.mssummary
  • vault.core.leadership_lost.lower provides the lower bound for time taken by cluster leadership losses.
  • Use vault.core.leadership_lost.mean provides the mean for time taken by cluster leadership losses.
  • vault.core.leadership_lost.stddev provides the standard deviation for time taken by cluster leadership losses.
  • vault.core.leadership_lost.sum provides the sum of time taken by cluster leadership losses.
  • vault.core.leadership_lost.upper provides the upper bound for time taken by cluster leadership losses.

Why it is important:

The measured value of this metric answers the question "how long was this server the leader, when it lost leadership?"

What to look for:

Any count greater than zero means that Vault experienced a leadership change and could potentially be cause for alerting. Do note that a high mean value here is considered better than a low value, because a low value means there is leadership flapping.

»vault.core.post_unseal

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time taken by post-unseal operations handled by Vault core.msgauge
  • vault.core.post_unseal.lower provides the lower bound for time taken by post-unseal setup.
  • Use vault.core.post_unseal.mean provides the mean for time taken by post-unseal setup.
  • vault.core.post_unseal.stddev provides the standard deviation for time taken by post-unseal setup.
  • vault.core.post_unseal.sum provides the sum of time taken by post-unseal setup.
  • vault.core.post_unseal.upper provides the upper bound for time taken by post-unseal setup.

Why it is important:

This metric is good for support or debugging problems with Vault startup after unsealing.

»Replication metrics

If you use Vault Enterprise Replication, then these metrics will be of importance in monitoring for every primary and secondary cluster that participates in replication.

»vault.replication.wal.last_wal

Metric sourceDescriptionUnitType
VaultThis metric represents the index of the last WAL.sequence numbergauge
  • vault.replication.wal.last_wal.value provides the last WAL index.

»vault.replication.wal.last_dr_wal

Metric sourceDescriptionUnitType
VaultThis metric represents the index of the last Disaster Recovery (DR) replication WAL.sequence numbergauge
  • vault.replication.wal.last_dr_wal.value provides the last DR mode replication WAL index.

»vault.replication.wal.last_performance_wal

Metric sourceDescriptionUnitType
VaultThis metric represents the index of the last Performance Replication WAL.sequence numbergauge
  • vault.replication.wal.last_performance_wal.value provides the last Performance mode replication WAL index.

»vault.replication.fsm.last_remote_wal

Metric sourceDescriptionUnitType
VaultThis metric represents the index of the last remote WAL.sequence numbergauge
  • vault.replication.fsm.last_remote_wal.value provides the last remote WAL index.

»Replication RPC metrics

These metrics represent replication RPC measurements that are provided by Vault.

»replication.rpc.client.stream_wals

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time taken by the client to stream WALs.mssummary
  • replication.rpc.client.stream_wals.lower provides the lower bound for time taken by a client to stream WALs.
  • Use replication.rpc.client.stream_wals.mean provides the mean for time taken by a client to stream WALs.
  • replication.rpc.client.stream_wals.stddev provides the standard deviation for time taken by a client to stream WALs.
  • replication.rpc.client.stream_wals.sum provides the sum of time taken by a client to stream WALs.
  • replication.rpc.client.stream_wals.upper provides the upper bound for time taken by a client to stream WALs.

»vault.replication.rpc.client.fetch_keys

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time taken by a client to perform a fetch keys request.mssummary
  • vault.replication.rpc.client.fetch_keys.lower provides the lower bound for time taken by a client to perform a fetch keys request.
  • Use vault.replication.rpc.client.fetch_keys.mean provides the mean for time taken by a client to perform a fetch keys request.
  • vault.replication.rpc.client.fetch_keys.stddev provides the standard deviation for time taken by a client to perform a fetch keys request.
  • vault.replication.rpc.client.fetch_keys.sum provides the sum of time taken by a client to perform a fetch keys request.
  • vault.replication.rpc.client.fetch_keys.upper provides the upper bound for time taken by a client to perform a fetch keys request.

»vault.replication.rpc.client.conflicting_pages

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time taken by a client conflicting page request.mssummary
  • vault.replication.rpc.client.conflicting_pages.lower provides the lower bound for time taken by a client conflicting page request.
  • Use vault.replication.rpc.client.conflicting_pages.mean provides the mean for time taken by a client conflicting page request.
  • vault.replication.rpc.client.conflicting_pages.stddev provides the standard deviation for time taken by a client conflicting page request.
  • vault.replication.rpc.client.conflicting_pages.sum provides the sum of time taken by a client conflicting page request.
  • vault.replication.rpc.client.conflicting_pages.upper provides the upper bound for time taken by a client conflicting page request.

»vault.replication.merkleSync

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time to perform a Merkle Tree based synchronization using the last delta generated between the clusters participating in replication.mssummary
  • vault.replication.merkleSync.lower provides the lower bound for time to perform a Merkle Tree based synchronization.
  • Use vault.replication.merkleSync.mean provides the mean for time to perform a Merkle Tree based synchronization.
  • vault.replication.merkleSync.stddev provides the standard deviation for time to perform a Merkle Tree based synchronization.
  • vault.replication.merkleSync.sum provides the sum of time to perform a Merkle Tree based synchronization.
  • vault.replication.merkleSync.upper provides the upper bound for time to perform a Merkle Tree based synchronization.

»vault.replication.merkleDiff

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time to perform a Merkle Tree based delta generation between the clusters participating in replicationmssummary
  • vault.replication.merkleDiff.lower provides the lower bound for time to perform a Merkle Tree based delta generation.
  • Use vault.replication.merkleDiff.mean provides the mean for time to perform a Merkle Tree based delta generation.
  • vault.replication.merkleDiff.stddev provides the standard deviation for time to perform a Merkle Tree based delta generation.
  • vault.replication.merkleDiff.sum provides the sum of time to perform a Merkle Tree based delta generation.
  • vault.replication.merkleDiff.upper provides the upper bound for time to perform a Merkle Tree based delta generation.

»Write-ahead log metrics

These metrics relate to Vault Write Ahead Log (WAL) operations.

»vault.wal_gc_total

Metric sourceDescriptionUnitType
VaultThis metric represents the total Number of Write Ahead Logs (WAL) on disk.WALcounter
  • vault.wal_gc_total.value provides the total number.

»vault.wal.persistWALs

Metric sourceDescriptionUnitType
VaultThis metric represents the amount of time required to persist the Vault write-ahead logs (WAL) to the storage backend.mssummary
  • vault.wal.persistWALs.lower provides the lower bound for time required to persist WALs to the storage backend.
  • Use vault.wal.persistWALs.mean provides the mean for time required to persist WALs to the storage backend.
  • vault.wal.persistWALs.stddev provides the standard deviation for time required to persist WALs to the storage backend.
  • vault.wal.persistWALs.sum provides the sum of time required to persist WALs to the storage backend.
  • vault.wal.persistWALs.upper provides the upper bound for time required to persist WALs to the storage backend.

»vault.wal.flushReady

Metric sourceDescriptionUnitType
VaultThis metric represents the amount of time required to flush the Vault write-ahead logs (WAL) to the persist queue.mssummary
  • vault.wal.flushReady.lower provides the lower bound for time required to flush the Vault WALs to the persist queue.
  • Use vault.wal.flushReady.mean provides the mean for time required to flush the Vault WALs to the persist queue.
  • vault.wal.flushReady.stddev provides the standard deviation for time required to flush the Vault WALs to the persist queue.
  • vault.wal.flushReady.sum provides the sum of time required to flush the Vault WALs to the persist queue.
  • vault.wal.flushReady.upper provides the upper bound for time required to flush the Vault WALs to the persist queue.

Why it is important:

The Vault write-ahead logs (WALs) are used to replicate Vault data between clusters. WALs are written and stored even if Enterprise Replication is not currently enabled. The WAL is purged every few seconds by a garbage collector. But if Vault is under heavy load, the WALs may start to accumulate, putting pressure on the storage.

What to look for:

  • flushReady is over 500ms
  • persistWALs is over 1000ms

»Identity metrics

These metrics represent identity entity measurements that are provided by Vault.

»vault.identity.num_entities

Metric sourceDescriptionUnitType
VaultThis metric was introduced in version 1.4.1 and represents the number of identity entities.entitiesgauge
  • vault.identity.num_entities.value provides the total number of identity entities.

»Expiration metrics

These metrics represent lease measurements that are provided by Vault.

»vault.expire.num_leases

Metric sourceDescriptionUnitType
VaultThis metric represents the number of all leases which are eligible for eventual expiry.leasesgauge
  • vault.expire.num_leases.value provides the total number of leases which are eligible for eventual expiry.

Why it is important:

This value represents an approximate total lease count for Vault across all lease generating auth methods and secrets engines.

What to look for:

Large and unexpected delta in count can indicate a bulk operation, load testing, or runaway client application is generating excessive leases and should be immediately investigated.

»vault.expire.revoke

Metric sourceDescriptionUnitType
VaultThis metric represents the duration of time to revoke a token.mssummary
  • vault.expire.revoke.lower provides the lower bound for time to revoke a token.
  • Use vault.expire.revoke.mean provides the mean for time to revoke a token.
  • vault.expire.revoke.stddev provides the standard deviation for time to revoke a token.
  • vault.expire.revoke.sum provides the sum of time to revoke a token.
  • vault.expire.revoke.upper provides the upper bound for time to revoke a token.

»Integrated storage metrics

These metrics relate to the integrated storage (Raft) backend. If you use this storage backend, you should monitor these metrics.

»vault.raft-storage.delete

Metric sourceDescriptionUnitType
VaultThis metric represents the time to insert a log entry into the delete path.mssummary
  • vault.raft-storage.delete.lower provides the lower bound for time to insert a log entry into the delete path.
  • Use vault.raft-storage.delete.mean provides the mean for time to insert a log entry into the delete path.
  • vault.raft-storage.delete.stddev provides the standard deviation for time to insert a log entry into the delete path.
  • vault.raft-storage.delete.sum provides the sum of time to insert a log entry into the delete path.
  • vault.raft-storage.delete.upper provides the upper bound for time to insert a log entry into the delete path.

»vault.raft-storage.get

Metric sourceDescriptionUnitType
VaultThis metric represents the time to retrieve a value for a path.mssummary
  • vault.raft-storage.get.lower provides the lower bound for time to retrieve value for path from the finite state manager.
  • Use vault.raft-storage.get.mean provides the mean for time to retrieve value for path from the finite state manager.
  • vault.raft-storage.get.stddev provides the standard deviation for time to retrieve value for path from the finite state manager.
  • vault.raft-storage.get.sum provides the sum of time to retrieve value for path from the finite state manager.
  • vault.raft-storage.get.upper provides the upper bound for time to retrieve value for path from the finite state manager.

»vault.raft-storage.put

Metric sourceDescriptionUnitType
VaultThis metric represents the time to insert a log entry to the persist path.mssummary
  • vault.raft-storage.put.lower provides the lower bound for time to insert a log entry into the persist path.
  • Use vault.raft-storage.put.mean provides the mean for time to insert a log entry into the persist path.
  • vault.raft-storage.put.stddev provides the standard deviation for time to insert a log entry into the persist path.
  • vault.raft-storage.put.sum provides the sum of time to insert a log entry into the persist path.
  • vault.raft-storage.put.upper provides the upper bound for time to insert a log entry into the persist path.

»vault.raft-storage.list

Metric sourceDescriptionUnitType
VaultThis metric represents the time to list all entries.mssummary
  • vault.raft-storage.list.lower provides the lower bound for time to list all entries under the prefix from the finite state manager.
  • Use vault.raft-storage.list.mean provides the mean for time to list all entries under the prefix from the finite state manager.
  • vault.raft-storage.list.stddev provides the standard deviation for time to list all entries under the prefix from the finite state manager.
  • vault.raft-storage.list.sum provides the sum of time to list all entries under the prefix from the finite state manager.
  • vault.raft-storage.list.upper provides the upper bound for time to list all entries under the prefix from the finite state manager.

»Consul storage metrics

These metrics relate to the Consul. If you use this storage backend, you should monitor these metrics.

The metrics below relate to Consul when used as a storage backend. They are available in Vault telemetry. However, for a full list of Consul metrics, refer to the Monitoring Consul Datacenter Health guide.

»vault.consul.get

Metric sourceDescriptionUnitType
VaultThis metric represents GET operations against the Consul storage backend.mssummary
  • vault.consul.get.count provides the number of GET operations against the Consul storage backend.
  • vault.consul.get.lower provides the lower bound for duration of GET operations against the Consul storage backend.
  • Use vault.consul.get.mean provides the mean for duration of GET operations against the Consul storage backend.
  • vault.consul.get.stddev provides the standard deviation for duration of GET operations against the Consul storage backend.
  • vault.consul.get.sum provides the sum of duration of GET operations against the Consul storage backend.
  • vault.consul.get.upper provides the upper bound for duration of GET operations against the Consul storage backend.

»vault.consul.put

Metric sourceDescriptionUnitType
VaultThis metric represents PUT operations against the Consul storage backend.mssummary
  • vault.consul.put.count provides the number of PUT operations against the Consul storage backend.
  • vault.consul.put.lower provides the lower bound for duration of PUT operations against the Consul storage backend.
  • Use vault.consul.put.mean provides the mean for duration of PUT operations against the Consul storage backend.
  • vault.consul.put.stddev provides the standard deviation for duration of PUT operations against the Consul storage backend.
  • vault.consul.put.sum provides the sum of duration of PUT operations against the Consul storage backend.
  • vault.consul.put.upper provides the upper bound for duration of PUT operations against the Consul storage backend.

»vault.consul.list

Metric sourceDescriptionUnitType
VaultThis metric represents LIST operations against the Consul storage backend.mssummary
  • vault.consul.list.count provides the number of LIST operations against the Consul storage backend.
  • vault.consul.list.lower provides the lower bound for duration of LIST operations against the Consul storage backend.
  • Use vault.consul.list.mean provides the mean for duration of LIST operations against the Consul storage backend.
  • vault.consul.list.stddev provides the standard deviation for duration of LIST operations against the Consul storage backend.
  • vault.consul.list.sum provides the sum of duration of LIST operations against the Consul storage backend.
  • vault.consul.list.upper provides the upper bound for duration of LIST operations against the Consul storage backend.

»vault.consul.delete

Metric sourceDescriptionUnitType
VaultThis metric represents DELETE operations against the Consul storage backend.mssummary
  • vault.consul.delete.count provides the number of DELETE operations against the Consul storage backend.
  • vault.consul.delete.lower provides the lower bound for duration of DELETE operations against the Consul storage backend.
  • Use vault.consul.delete.mean provides the mean for duration of DELETE operations against the Consul storage backend.
  • vault.consul.delete.stddev provides the standard deviation for duration of DELETE operations against the Consul storage backend.
  • vault.consul.delete.sum provides the sum of duration of DELETE operations against the Consul storage backend.
  • vault.consul.delete.upper provides the upper bound for duration of DELETE operations against the Consul storage backend.

Why it is important:

These metrics indicate how long it takes for Consul to handle requests from Vault.

What to look for:

Large deltas in the count, upper, or 90_percentile fields.

»Vault usage metrics

The following are fine-grained usage metrics from Vault telemetry introduced in version 1.5. They are related to common types of usage including identity, lease, secret, and token usage.

These metrics are the most useful for business users to measure Vault usage for metering, billing, and similar use cases.

»vault.token.creation

Metric sourceDescription
VaultA new service or batch token was created. (Name chosen to be distinct from vault.token.create, an existing sample metric.)
  • vault.token.creation.value provides the number.

»vault.token.count

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents the number of service tokens available for use.
  • vault.token.count.value provides the number.

»vault.token.count.by_auth

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents the number of existing tokens broken down by the auth method used to create them.
  • vault.token.count.by_auth.value provides the number.

»vault.token.count.by_policy

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents the number of existing tokens, counted in each policy assigned.
  • vault.token.count.by_policy.value provides the number.

»vault.token.count.by_ttl

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents the number of existing tokens, aggregated by their TTL at creation.
  • vault.token.count.by_ttl.value provides the number.

»vault.secret.kv.count

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents the count of secrets in key-value stores.
  • vault.secret.kv.count.value provides the number.

»vault.secret.lease.creation

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents a count of leases created by a secret engine (excluding leases created internally for token expiration.)
  • vault.secret.lease.creation.value provides the number.

»vault.identity.entity.count

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents the number of identity entities.
  • vault.identity.entity.count.value provides the number.

»vault.identity.entity.creation

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents a count of identity entity creation, either from manual creation or automatically upon login with an auth method.
  • vault.identity.entity.creation.value provides the number.

»vault.identity.entity.alias.count

Metric sourceDescription
VaultThis metric was introduced in version 1.5.0 and represents the number of identity aliases to entities.
  • vault.identity.entity.alias.count.value provides the number.

»File descriptor metrics

These metrics represent system level file descriptor measurements that are provided by the Telegraf agent.

»linux_sysctl_fs.file-nr

Metric sourceDescription
TelegrafThis metric represents the number of file handles being used across all processes on the host, and is provided by the Telegraf agent.

»linux_sysctl_fs.file-max

Metric sourceDescription
TelegrafThis metric represents the total number of available file handles, and is provided by the Telegraf agent.

Why it is important:

The majority of Vault operations which interact with systems outside of Vault, for example receiving a connection from another host, sending data between hosts, or writing to disk in the case of integrated storage, require a file descriptor handle.

If either Vault or Consul (in the case of Vault using Consul storage backend) runs out of handles, it will stop accepting connections.

What to look for:

When file-nr exceeds 80% of file-max, you should alert operators to take proactive measures for reducing load and at least temporarily increasing user limits.

»Vault audit device metrics

The following are details about the audit device log data and how you can effectively search them in Splunk.

Get to know the key fields accessible from audit device log data:

  • type: The type of an audit entry, either "request" or "response". For a successful request, there will always be two events. The audit log event of type "response" will include the "request" structure, and will have all the same data as a request entry. For successful requests you can do all your searching on the events with type "response".

  • request.path: The path of an API request (request|response).mount_type: The type of the mount that handles this request or response.

  • request.operation: The operation performed (eg read, create, delete...) auth: The authentication information for the caller

    • .entity_id: If authenticated using an auth backend, the entity-id of the user/service
    • .role_name: Depending on auth backend, the role name of the user/service
  • error: Populated (non-empty) if the response was an error, this field contains the error message.

  • response.data: In the case of a successful response, many will contain a data field corresponding to what was returned to the caller. Most fields will be masked with HMAC when sensitive.

»Metrics Summary

You have now been introduced to the most critical Vault operational and usage metrics along with information about monitoring and responding to specific examples.

If you want to review a practical configuration example or try the example in an online tutorial, continue on with the practical example.

»Practical example

Vault with Fluentd, Telegraf, and Splunk diagram

You can use the information here to build an example monitoring stack built on Telegraf, Fluentd, and Splunk. It demonstrates a complete example solution to help you get started, and inform your own monitoring solution.

Splunk is a popular choice for searching, monitoring, and analyzing application generated data. Fluentd is typically installed on the Vault servers, and helps with sending Vault audit device log data to Splunk. Telegraf agents installed on the Vault servers help send Vault telemetry metrics and system level metrics such as those for CPU, memory, and disk I/O to Splunk. This results in a comprehensive solution to provide insights into a running Vault cluster.

Additionally, a Vault Enterprise Splunk application is available that bundles popular metrics dashboards for operators, security practitioners, and users concerned with metering usage.

While this practical example is provided as a convenience, the information is also a helpful resource for learning about monitoring and alerting on the important data Vault provides to users and operators.

»Notes and prerequisites

To follow along with the practical example, you must install and configure the following software in a Linux or macOS environment.

  • Vault - Either the open source version or Enterprise version can be used. Note that the Enterprise trial version will operate for 30 minutes before sealing itself. Install Vault has more details.
  • Splunk - This example uses the Splunk Enterprise trial version, but the Splunk Cloud or free version will also work. Be aware that the free version is limited to 500MB of daily data ingestion.
  • Fluentd is used to capture and forward events from an enabled audit device log.
  • Telegraf is used to capture and forward Vault telemetry metrics and system level metrics; packages are provided for common Linux distributions and Homebrew for macOS.

Generally useful configuration instructions are shared here based on standard installations of the software for Linux or macOS using a combination of command line tools, configuration file editing, and web user interfaces.

You should already be comfortable operating and configuring Vault to follow along with this example. This example presumes that you can install and configure fluentd, Telegraf, and Splunk and then update an existing Vault configuration to add telemetry functionality.

The versions of software used in this example are as follows:

  • Vault v1.4.3
  • Fluentd td-agent v1.11
  • Telegraf v1.12.6
  • Splunk Enterprise v8.0.4.1

»Online tutorial

If you'd like to check out using Vault, Fluentd, Telegraf, and Splunk together in a fully pre-configured and hands-on environment, give this online tutorial a try.

It uses the same set of technologies and workflow in a Docker environment.

Click Show Tutorial here or at any time on the navigation menu to launch the tutorial.

»Notes about the metrics path

In the data path used for this practical example, Vault exports telemetry metrics, which are rolled up by Telegraf at a configurable interval. The metrics which are then pushed from Telegraf to Splunk take on a slightly different format than the source metric types as defined in go-metrics, which Vault uses for its Telemetry.

Within Splunk, you can expect the following metric type definitions:

  • Counters are represented in Splunk as <metric>.value
  • Gauges are represented in Splunk as <metric>.value
  • Samples are represented in Splunk as <metric>.count, <metric>.mean, <metric>.upper, and so on with each metric measured within the Telegraf collection window

For more technical details, you can consult the Telegraf Service Plugin: statsd documentation.

Where possible, all examples here include the available metric type definitions that can be used.

»Configure Splunk

There are two major areas of Splunk configuration required to collect and search the audit device log data from Fluentd and the telemetry metrics data from Telegraf.

»Audit device data configuration

  • An Events Index is an index type that is optimized for storage and retrieval of metric data.
  • An HTTP Event Collector (HEC) and its associated access token lets you securely send audit device logs to Splunk over the HTTP and Secure HTTP (HTTPS) protocols. In the example used in this guide, Fluentd will be configured to send these data to Splunk.

»Telemetry metrics configuration

  • A Metrics Index is an index type that is optimized for storage and retrieval of metric data.
  • An HTTP Event Collector and its associated access token lets you securely send metrics to Splunk over the HTTP and Secure HTTP (HTTPS) protocols. In this example, Telegraf will be configured to send these data to Splunk.

You will also need to disable SSL on the HEC and add the Vault indexes to the Splunk admin role.

You can configure Splunk with Splunk Web, the splunk CLI, or HTTP API. The configuration process is currently detailed here only for Splunk Web or the splunk CLI.

Use a browser to open the Splunk Web interface at http://localhost:8000.

Sign in with the username and password of your admin user.

Configure Splunk to receive and index both Vault audit device log and telemetry data into their corresponding HTTP Event Collectors and index types.

»Add events index

Add the events index to contain the audit device log data.

Example metrics index configuration

  1. From the Splunk Web navigation menu, select Settings.
  2. From under the Data menu, select Indexes.
  3. Click New Index and you will encounter a dialog like the example shown here.
  4. For Index Name, enter vault-audit.
  5. For Index Data Type, select Events.
  6. Leave all other options at their default values.
  7. Click Save.

More information about creating events indexes is available in the Splunk documentation for creating Events indexes.

After creating the events index, you can proceed to creating a metrics index.

»Add metrics index

Add the metrics index to contain the telemetry metrics from Telegraf and Vault.

Example metrics index configuration

  1. From the Splunk Web navigation menu, select Settings.
  2. From under the Data menu, select Indexes.
  3. Click New Index and you will encounter a dialog like the example shown here.
  4. For Index Name, enter vault-metrics.
  5. For Index Data Type, select Metrics.
  6. Leave all other options at their default values.
  7. Click Save

More information about creating metrics indexes is available in the Splunk documentation for creating metrics indexes.

After creating the indexes, you can proceed to creating the HTTP Event Collectors (HECs) for use by Fluentd and Telegraf for sending audit and metrics data from Vault to Splunk.

»Add HEC for Vault audit device

Add the HEC for Vault audit device logs and save the token for later use with the Fluentd configuration.

Example data inputs configuration

  1. From the Splunk Web navigation menu, select Settings.
  2. From under the Data menu, select Data inputs.
  3. Under the Local inputs section, click Add new beside HTTP Event Collector.

Follow this stepwise process to configure the HEC as shown in the examples.

Example HEC source configuration

First, configure the Select Source settings.

  1. For Name, enter Vault Audit.
  2. For Description, enter Vault file audit device log.
  3. Click Next.

Then, configure the Input Settings.

Example HEC input configuration

  1. From the Input Settings page click New next to Source type.
  2. For Source Type, enter hashicorp_vault_audit_log.
  3. For Source Type Description, enter Vault file audit device log.
  4. Now, scroll down to the Index section.
  5. Click vault-audit to select it as an allowed index.
  6. Click Review.
  7. Click Submit.

You should observe a "Token has been created successfully." message and dialog as shown.

Example HEC token confirmation

Copy the complete value from Token Value and save it for later use when configuring Fluentd.

»Add HEC for Vault telemetry metrics

Add the HEC for Vault telemetry metrics and save the resulting token for later use with the Telegraf configuration.

Example data inputs configuration

  1. From the Splunk Web navigation menu, select Settings.
  2. From under the Data menu, select Data inputs.
  3. Under the Local inputs section, click Add new beside HTTP Event Collector.

Follow this stepwise process to configure the HEC as shown in the examples.

Example HEC source configuration

First, configure the Select Source settings.

  1. For Name, enter Vault telemetry.
  2. For Description, enter Vault telemetry metrics.
  3. Click Next.

Then, configure the Input Settings.

Example HEC input configuration

  1. From the Input Settings page click New next to Source type.
  2. For Source Type, enter hashicorp_vault_telemetry.
  3. For Source Type Description, enter Vault telemetry metrics.
  4. From the Input Settings page scroll down to the Index section.
  5. Click vault-metrics to select it as an allowed index.
  6. Click Review.
  7. Click Submit.

You should observe a "Token has been created successfully." message and dialog as shown.

Example HEC token confirmation

Copy the complete value from Token Value and save it for later use when configuring Telegraf.

»Disable SSL on HEC

Splunk enables SSL with a self-signed certificate by default, including for the HEC listeners. To disable SSL on all HEC listeners, access the HEC global settings.

Example HEC token global settings

  1. From the Web Splunk navigation menu, select Settings.
  2. From under the Data menu, select Data inputs.
  3. Click HTTP Event Collector.
  4. Click Global Settings.
  5. Uncheck Enable SSL.
  6. Click Save.

»Add indexes to admin role

Finally, enable the vault-audit and vault-metrics indexes for the admin role so that the searches work as expected.

Example admin role settings screen

  1. From the Web Splunk navigation menu, select Settings.
  2. From under the USERS AND AUTHENTICATION section, select Roles.
  3. Click admin.
  4. Click 3. Indexes.
  5. Scroll to the bottom of the indexes list.
  6. Check the check-boxes for both the Included and the Default columns for both vault-audit and vault-metrics indexes.
  7. Click Save.

Example admin role settings selected screen

This completes the Splunk configuration.

You are now ready to configure Fluentd.

»Configure Fluentd

Fluentd configuration requires that you install td-agent on your Vault servers. You must also install the Fluentd Splunk HEC plugin and configure td-agent by editing its configuration file.

Once you have installed Fluentd and the Fluentd Splunk HEC plugin, use an editor to configure Fluentd.

Edit the td-agent.conf configuration file and add this example input source description for the Vault file audit device log.

<source>
  @type tail
  path /vault/logs/vault-audit.log
  pos_file /vault/logs/vault-audit-log.pos
  <parse>
    @type json
    time_format %iso8601
  </parse>
  tag vault_audit
</source>

<match vault_audit.**>
  @type splunk_hec
  host 10.10.42.100
  port 8088
  token 12b8a76f-3fa8-4d17-b67f-78d794f042fb
</match>

The following values need updates to match your environment.

Update these values under <source>.

  • path is the full path to your Vault audit device log file. Specify an existing file audit device log file here or leave the example as-is if you need to enable an audit device. Instructions for doing so are provided in the Configure Vault section.
  • pos_file is a similarly named file that Fluentd uses for recording file position.

Update these values under <match>.

  • host is the hostname or IP address of your Splunk server.
  • port is the configured Splunk HTTP Event Listener port number.
  • token is the HEC token value for the audit device HEC.

After configuring td-agent, start or restart the service as necessary.

$ systemctl restart td-agent

You can check the td-agent logs for signs of any issues; if you do not yet have a running Vault server, the td-agent logs will likely contain a repetitive error about the missing audit device log file. This is both expected, and not really an issue as td-agent will keep retrying until it can read the file.

This completes the Fluentd configuration.

You are now ready to configure Telegraf for statsd compatible input from Vault and HTTP output to the Splunk HEC.

»Configure Telegraf

Telegraf can act as both a statsd compatible agent, and collect additional metrics of its own. Telegraf provides a range of input plugins to collect data from common sources.

Enable the most common plugins to monitor CPU, memory, disk I/O, networking, and process status in addition to the input and output plugins for Vault and Splunk respectively.

Here is a complete working example Telegraf configuration.

# Global tags relate to and are available for use in Splunk searches
# Of particular note are the index tag, which is required to match the
# configured metrics index name and the cluster tag which should match the
# value of Vault's cluster_name configuration option value.

[global_tags]
  index="vault-metrics"
  datacenter = "us-east-1"
  role       = "vault-server"
  cluster    = "vtl"

# Agent options around collection interval, sizes, jitter and so on
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

# An input plugin that listens on UDP/8125 for statsd compatible telemetry
# messages using Datadog extensions which are emitted by Vault
[[inputs.statsd]]
  protocol = "udp"
  service_address = ":8125"
  metric_separator = "."
  datadog_extensions = true

# An output plugin that can transmit metrics over HTTP to Splunk
# You must specify a valid Splunk HEC token as the Authorization value
[[outputs.http]]
  url = "http://10.42.10.100:8088/services/collector"
  data_format="splunkmetric"
  splunkmetric_hec_routing=true
  [outputs.http.headers]
    Content-Type = "application/json"
    Authorization = "Splunk 42c0ff33-c00l-7374-87bd-690ac97efc50"

# Read metrics about cpu usage using default configuration values
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

# Read metrics about memory usage
[[inputs.mem]]
  # No configuration required

# Read metrics about network interface usage
[[inputs.net]]
  # Uses default configuration

# Read metrics about swap memory usage
[[inputs.swap]]
  # No configuration required

# Read metrics about disk usage using default configuration values
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  ## mount_points = ["/"]
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]
  # devices = ["sda", "sdb"]
  # skip_serial_number = false

[[inputs.kernel]]
  # No configuration required

[[inputs.linux_sysctl_fs]]
  # No configuration required

[[inputs.net]]
  # Specify an interface or all
  # interfaces = ["enp0s*"]

[[inputs.netstat]]
  # No configuration required

[[inputs.processes]]
  # No configuration required

[[inputs.procstat]]
 pattern = "(vault)"

[[inputs.system]]
  # No configuration required

The telegraf.conf file starts with global options.

The default collection interval is set to 10 seconds and a host tag is included in each metric.

As previously mentioned, Telegraf also allows you to set additional tags on the metrics that pass through it. In this case, you are adding tags for the cluster, index, datacenter, and role.

These tags can then be used in Splunk to filter queries (for example, to create a dashboard showing only servers with the vault-server role, or only servers in the us-east-1 datacenter).

Finally, there are inputs for items like CPU, memory, network I/O, and disk I/O. Most of them don't require any configuration, but make sure the interfaces list in inputs.net matches the interface names you observe in ip addr or ifconfig output on the appropriate servers.

Exceptions to the default values used here are the global tag for the Splunk vault-metrics index, the input for Vault metrics, and the output plugin to send data to Splunk via HEC.

Once you have configured your Telegraf installation, ensure that it is started. You can also check its log output to ensure there are no issues, and move on to configuring Vault for exporting telemetry metrics.

»Configure Vault

Use the telemetry stanza to configure Vault.

Add a variation of the following example to your Vault configuration file based on the following guidance on each value:

cluster_name = "vtl"
telemetry {
  dogstatsd_addr = "localhost:8125"
  enable_hostname_label = true
  prometheus_retention_time = "0h"
}

The cluster_name option is used at the global configuration scope and specifies a label for the Vault cluster. You can use a pre-existing value.

The options contained in the example telemetry stanza break down as follows.

  • dogstatsd_addr specifies that the statsd protocol-compatible listener (the function is provided by Telegraf) can be reached at the host localhost and port UDP/8125
  • enable_hostname_label enable a hostname label from the metrics source
  • prometheus_retention_time by specifying a retention time of 0 hours, the Prometheus metrics endpoint is effectively disabled

Once you have Vault configured, you need to start or restart it as required.

After Vault is available for use and you are authenticated to it with a token that has sufficient capabilities to enable an audit device, proceed to enabling the file audit device if you will not be using an existing one.

»Enable file audit device

Unless you can reuse an existing file audit device, the last step in Vault configuration requires that you enable one.

First, ensure that the vault process user has permission to write to the target log output directory, /var/log in this example.

Then, use the vault CLI to enable a file audit device that writes audit request and response data to the file /vault/logs/vault-audit.log.

$ vault audit enable file file_path=/vault/logs/vault-audit.log

Successful example output:

Success! Enabled the file audit device at: file/

Now that Vault is configured, you can begin to explore the audit device and metrics data in Splunk.

A Splunk App for Monitoring Vault, which consists of pre-built dashboards and reports, is available with Vault Enterprise. Without the app, you can still build your own dashboards from scratch, as all the data sources are available with all versions of Vault. The immediate topics below describe how you can explore metrics and search events. We also provide some example search queries.

»Explore metrics

An example of Splunk analytics functionality

You can explore the metric data in Splunk to learn more about what is available.

  1. Click the Splunk Enterprise logo image to reach the Splunk Web home page.
  2. Click Search & Reporting.
  3. Click Analytics.
  4. Click Metrics to drill into the metrics.

From here, you can explore all of the metrics exported by Telegraf, including Vault and system level metrics. By browsing here and finding metrics to chart, you can add them to dashboards for reuse.

»Search events

An example of Splunk search functionality

You can also explore the audit device log events in a similar manner as the metrics.

  1. Click the Splunk Enterprise logo image to reach the Splunk Web home page.
  2. Click Search & Reporting.
  3. In the search field, enter index="vault-audit".
  4. You should observe some results similar to the examples in the screenshot.

»Example search queries

Suppose you want to examine the time-to-live assigned to all tokens created in the past 24 hours. Access the search bar, and use the vault.token.creation metric, to obtain the total across all clusters in the index like this example.

| mstats sum(vault.token.creation.value) AS count WHERE index=vault-metrics BY creation_ttl

Token TTL graph

You should observe something like the example screenshot as a result; note the count is 85 for tokens having a 2 hour TTL, and 201 for tokens having an infinite TTL in the example screenshot.

Selecting a different time range returns the total over that range. The data will be shown as an interactive table that can be sorted by any of the columns.

You can further augment this search to limit it to a particular cluster, a particular auth method, or a particular mount point, or further break down the query by any of these labels.

For example, to examine the mount points which are creating large numbers of long-lived tokens use a search query like this example.

| mstats sum(vault.token.creation.value) AS count WHERE index=vault-metrics AND cluster=<your-cluster> AND creation_ttl=+Inf BY mount_point

Be sure to the change example value <your-cluster> to the actual cluster_name of the Vault cluster for which you wish to search the metrics.

Token TTL graph

You should observe something like the example screenshot as a result; note the count is 201 for tokens issued from the auth/token endpoint in the example screenshot. If there were more auth methods enabled you could expect to also notice those listed here with their respective counts.

To get a time series instead, add 'span=30m" to the end of the query to get one data point per 30 minutes.

»Example Dashboard

Let's build a simple dashboard around a mix of system information and Vault token creation information.

  1. Select Search & Reporting.
  2. Select Analytics.
  3. In the navigation at left click vault to expand the Vault metrics.
  4. Scroll down to the bottom of the list.
  5. Click token.
  6. Click create.
  7. Click count.
  8. In the Analysis section click the drop-down under Aggregation and select Sum.
  9. Click Chart Settings.
  10. Select Column.

You should observe a graph like this example.

Example token creation metric graph

With one chart ready, make another for the token creation mean duration.

  1. In the left column under Metrics and token click create.
  2. Click mean.
  3. Click Chart Settings.
  4. Select Column.

You should observe a graph like this example.

Example token creation metric graph

Next, make another graph for the CPU usage for user processes.

  1. In the left column under Metrics scroll to the top of the list.
  2. Click cpu.usage.
  3. Click user.
  4. Click Chart Settings.
  5. Select Area.

You should observe a graph like this example.

Example token creation metric graph

Finally, make another graph for allocated bytes of memory to the Vault process.

  1. In the left column under Metrics scroll down to vault.
  2. Click runtime.
  3. Click alloc_bytes.value.
  4. Click Chart Settings.
  5. Select Area.

You should observe a graph like this example.

Example token creation metric graph

With 4 graphs ready, follow these steps to create a dashboard based on them.

Save all to charts step

  1. Click the ellipse button in the upper right area of the middle pane as shown.
  2. Click Save all charts to a dashboard.
  3. Complete the Save All To Dashboard dialog.

Save dashboard dialog

  1. For Dashboard Title, enter Basic Vault Token Metrics.
  2. For Dashboard Description, enter some Some basic Vault token metrics and related system metrics in an example dashboard.
  3. Click Save.
  4. You should observe a Your Dashboard Has Been Created dialog.
  5. Click View Dashboard.

Now, you can edit the dashboard further to arrange it.

Example dashboard edit screen

  1. Click Edit.
  2. Drag the individual graphs to arrange them; in this example they are positioned 2 by 2.
  3. When the graphs are positioned, click Save.

Here is a screenshot of the final example dashboard.

Example dashboard

»Splunk App

Vault Enterprise users can take advantage of a complete Splunk app built with care by the HashiCorp product and engineering teams. It includes powerful dashboards that split metrics into logical groupings targets at both operators and business users of Vault. The Splunk application is available with Vault Enterprise Platform. However, all the data sources leveraged by the application are available with all versions of Vault.

The following are example dashboards and their metrics from the Vault Enterprise Splunk app along with some example queries that you can use with the App.

NOTE: Refer to this section for step by step intructions on configuring Splunk, Fluentd, Telegraf, and Vault to use the Splunk app.

»Vault Operations Metrics from Telemetry Dashboard

The operations dashboard combines self-reported data from Vault with information from the Telegraf agent on each host. You can filter to a particular time range, limit your view to a particular cluster, and select any subset of the hosts to display.

Vault operations metrics from telemetry dashboard

The seal status of each Vault instance is shown, with sealed instances displayed first. Disk I/O, Network I/O, and CPU statistics are listed for each host that has been selected (by default all on the same graph.) Important thresholds or summary statistics are indicated by a dotted line.

The next pair of graphs shows request latency, as reported by Vault. Requests to secret engines and login requests are plotted separately, with 50th and 90th percentile shown.

Failures and losses graph

The dashboard counts the total number of audit log failures during the selected time window; in a properly functioning system this should be zero. The time a leadership loss was reported appears here, as well as times of any leadership setup failure.

The next selection reports on memory usage and lease count; these memory statistics and their importance are described above.

Memory graph

A source of a high number of leases can be investigated on the usage dashboard. A high rate of lease revocation may explain performance problems. The final sections of the dashboard report on Vault’s replication and encryption mechanisms.

Replication graph

A larger than normal number of uncollected write-ahead-log entries may indicate a storage bottleneck. These entries are generated even if replication is not currently in use.

The barrier statistics show operations performed at Vault’s encryption barrier; the dashboard reports both the count of operations (left axis) and the average latency (right access). Typically these numbers closely track the operations at the storage layer, but Vault may perform caching or batching to combine or eliminate storage operations.

»Storage Metrics from Telemetry Dashboard

Integrated storage graph

This dashboard shows metrics about the integrated storage backend, or about the Consul storage backend. The top row shows measurement of operation latencies, in milliseconds, along with 50th and 90th percentiles.

For integrated storage, the next row shows the total count of operations by type, and statistics about storage entry sizes. The maximum size over the entire window is plotted as a dashed line; the blue line represents maximum within each time period, and a gray line represents mean entry size. The last row shows host I/O metrics relevant to integrated storage: disk I/O throughput, disk usage, and network I/O for each host selected in the drop-down.

»Vault Usage Metrics from Telemetry Dashboard

The usage metrics page gives information on tokens, secrets, leases and entities. The top panels let you examine the rate of token creation, and the number of tokens available for use in Vault.

Vault usage metrics from telemetry graph

Some combinations of filters are not available and will cause the corresponding panels to disappear. You can change the data series used for the time series plots by selecting "Namespace", "Auth methods", "Mount points", "Policies", or "Creation TTL".

For example, if you select "Mount points", then select "Auth method : github" the graphs will show information specific to any enabled GitHub auth methods.

The next section shows a count of the number of key-value secrets stored in Vault, and the rate of lease creation by mount point. These can be filtered by namespace, and the lease creation plot can be filtered by lease TTL or by secret engine.

Secrets and leases graph

The identity section gives information on which engines create entities, the total count, and the count by alias.

Identity graph

A final section shows the top 15 most common operations, by type and mount point, within the selected time window.

»Additional Use-Case Dashboards

The “Quota Monitoring” dashboard plots the vault.quota.rate_limit.violation and vault.quota.lease_count.violation metrics that were introduced in Vault 1.5 as part of the resource quotas feature. The percentage utilization is plotted for each lease count quota; note that this metric is only emitted when a lease is allocated, not when it expires or is revoked.

Resource Quotas

“Where are high TTL tokens created?” can be used to identify the sources of long-TTL tokens. It displays the auth mount points that have created such tokens (via the vault.token.created metric) and queries the audit log for additional information such as client IP address and username.

Token counts

»Customizing App Dashboards

If you use the Splunk App and would like to customize these dashboards for your own environment, we recommend that you clone them rather than editing in-place.

You can do this from the Dashboards view- select Clone in the Actions drop-down menu to create a copy of the existing dashboard.

This will ensure that when a new version of the Splunk App is available, you will have the latest changes from it.

»Example queries for App users

Note that these example queries will function only if you are using the Vault Enterprise Splunk app. If you encounter an error like the following example

Error in 'SearchParser': The search specifies a macro 'vault_audit_log' that cannot be found. Reasons include: the macro name is misspelled, you do not have "read" permission for the macro, or the macro has not been shared with this application. Click Settings, Advanced search, Search Macros to view macro information.

it means that you do not have the Vault Enterprise Splunk App installed and configured. Ensure that you install the App before proceeding with these examples.

The queries in the Splunk App (described below) use a macro to encapsulate the standard parts of a query, such as which index to search and what sourcetype to match. If you have the Splunk App installed, you may use the vault_telemetry alias to limit queries to metrics from Vault and Telegraf that appear in the vault-metrics index.

Another example use case is understanding the number of tokens with policy "admin" in use in the system. As this is a gauge metric, you cannot simply sum all the values as in the case above. Instead, you need take the most recent value of the metric, as in this example.

| mstats latest(vault.token.count.by_policy) AS count WHERE `vault_telemetry` AND policy=admin BY cluster,namespace earliest=-30m

One subtlety is that the "latest" for a particular combination of labels is not necessarily the same point in time. If a particular policy has stopped being used in namespace "ns1" for example, then a "latest" query matching that namespace may return any point during the time window.

Vault multi-dimensional use metrics do not report all possible zero values, because this would create undue load on the metrics collection. The easiest way to handle this time skew is to limit how far back in time the query may go.

Another pitfall to avoid is that in Splunk, metrics are not automatically summed across all possible label sets. So if you query for the latest token count gauge matching a policy, that gauge will represent only one cluster and one namespace.

Use the BY or WHERE clauses with the entire set of available labels, then sum the resulting values explicitly. For example, we you modify the query above to return all policy counts across namespaces, like this.

| mstats latest(vault.token.count.by_policy) AS count WHERE `vault_telemetry` BY cluster,namespace,policy earliest=-30m | stats sum(count) AS count BY policy
| sort -count

The resulting table counts the number of active tokens by policy, summed over all the clusters and namespaces where a token with a matching policy name appears. As before, you can turn this query into a time series by replacing earliest=-30m with span=30m to see the total within each half-hour window.

Here are a couple examples that use only the audit device log data. The following shows all login events, displaying the token display name, entity ID, and the policies that were attached to the token.

`vault_audit_log` response.auth.accessor=*
| spath output=policies path="response.auth.policies{}"
| table response.auth.display_name, response.auth.entity_id, policies

»Summary

You learned about two sources of Vault operational and usage data in the form of telemetry metrics and audit device logs. You also got to know some of the critical usage and operational metrics along with how to use them in a specific monitoring and graphing stack.

You also learned about a solution consisting of Fluentd, Telegraf, and Splunk for analyzing and monitoring Vault, and were given the opportunity to try it in a comprehensive online tutorial.

Finally, you learned about the Vault Enterprise Splunk app and some of the comprehensive graphs that it includes.

»Help and Reference