It is important to gain operational and usage insight into a running Vault cluster for the purposes of understanding performance and assisting with proactive incident response, along with understanding business workloads, and use cases.
Operators and security practitioners need to be aware of conditions that can indicate potential performance implications to production users or security issues which require immediate attention.
Business users concerned with charges or billing must be aware of specific usage regarding the counts of resources like dynamic secrets or their leases.
Vault provides rich operational telemetry metrics that can be consumed by popular solutions for monitoring and alerting on key operational conditions and audit devices for logging each Vault request and response.
Using the Vault telemetry and audit device features in combination with metrics and log aggregation agents in concert with an analysis amd monitoring solution can provide the necessary insight in Vault operations and usage.
Here, you will learn about important metrics to monitor, action steps for responding to anomalies with specific metrics.
NOTE: The guidance here builds from previous guidance in Vault Cluster Monitoring, which uses Telegraf, InfluxDB, and Grafana. This guidance here will eventually supersede the previous document.
You can also find guidance in the form of a practical example configuration that you can try in your own environment or in an online tutorial that builds out this environment for you.
There are 3 common approaches that you can use to monitor the health of an application like Vault.
Time-series telemetry data involves capturing metrics from the application, storing them in a special database or index, and analyzing trends in the data over time. Examples: Splunk, Grafana, CloudWatch, DataDog, Circonus
Log analytics relates to capturing log streams from the system and the application, extracting useful signals from the data, and then further analyzing the results. Examples: Splunk, Elasticsearch, SumoLogic
Active health checks use active methods of connecting to the application and interacting with it to ensure it is responding
properly. Examples: Consul, Nagios, Sensu, Keynote
All of these methods have their place in a comprehensive monitoring solution, but the focus here is on the capture and analysis of time-series telemetry metrics along with audit device log request and response data.
Graphite is an open-source tool for storing and graphing time-series data. It does not support dashboards or alerts, but Grafana can be used in conjunction with Graphite to provide those features.
Telegraf, InfluxDB, Chronograf, and Kapacitor- a monitoring solution that is commonly known as the TICK stack. Together, these 4 tools provide a full solution for storing, displaying, and alerting on time-series data.
This solution is available in both open-source and commercial versions from InfluxData.
CloudWatch is Amazon's solution for monitoring AWS cloud resources. It handles both time-series data and log files. If you are running Vault and Consul in AWS, it can be an easy choice to make.
One limitation of CloudWatch is that time-series data is only available at a 1-minute granularity and only for 15 days. After that, the data is rolled
up into 5-minute and one-hour buckets. For more details, see the CloudWatch FAQs.
Prometheus is a modern alternative to statsd-compatible daemons, using lightweight HTTP servers called "exporters" which are then scraped by a Prometheus server. Prometheus is increasingly popular in the containerized world.
Rather than the UDP-based push mechanism used by statsd, Prometheus relies on lightweight HTTP servers called "exporters" which collect the metrics that are then scraped by a Prometheus server.
NOTE: Vault provides configurable Prometheus compatible metrics from the /sys/metrics HTTP API endpoint.
DataDog is a commercial software as a service solution. They provide a customized statsd agent DogStatsd, that includes several vendor-specific extensions such as tagging and service check results.
If you use DataDog, you would use their DogStatsd instead of a tool like Telegraf.
There are numerous commercial and open-source choices, but configuring those solutions is beyond the scope of what you will learn here.
Instead, you will learn from a practical example monitoring solution based on Splunk, Fluentd, and Telegraf. Complete steps for configuration and an example dashboards to get you started are provided.
Vault Enterprise users can go even further with access to a Splunk app that features a rich variety of predefined dashboards.
Before diving into the practical example, you should take time to carefully review the following sections, which present important operational and usage metrics from both Vault and Consul.
The next 3 sections detail the information that you can get from Vault operational metrics, usage metrics, and audit device log data.
These sections are good for reference if you are already using a monitoring stack and would like to identify the critical data to monitor, and also to familiarize you with them if they are new to you.
The following are critical Vault operational metrics from Vault telemetry and from the Telegraf agent itself related to overall server health and system-level performance.
These metrics are the most useful for ops teams to monitor and alert on in production deployments.
NOTE: This seal status health check metric is relevant only when using Consul for high availability coordination or storage and in such cases, the metric is emitted by the Consul agent, not Vault itself.
For this metric, a value of 1 indicates Vault is unsealed, whereas 0 means that Vault is sealed.
Why it is important:
By default, Vault is sealed on startup, so if this value changes to 0 during the day, Vault has restarted for some reason.
And until it's unsealed, it won't answer requests from clients.
This metric represents the percentage of CPU time spent waiting for I/O tasks to complete.
Why it is important:
Encryption can place a heavy demand on the CPU. If the CPU is too busy, Vault may have trouble keeping up with the incoming request load. You may also want to monitor each CPU individually to make sure requests are evenly balanced across all CPUs.
This metric represents the bytes transmitted on each network interface.
Why it is important:
A sudden spike in network traffic to Vault might be the result of an anomalous client causing too many requests, or additional load you did not plan for.
What to look for:
Sudden large changes to the net metrics (greater than 50% deviation from baseline).
These metrics represent both system level memory measurements that are provided by the Telegraf agent and Vault specific memory measurements that are provided as part of the Vault runtime.
This metric represents the number of goroutines associated with the vault process. This metric can serve as a general system load indicator and is worth establishing a baseline and thresholds for alerting.
goroutines
summary
vault.runtime.num_goroutines.value provides the value.
Why it is important:
Blocked goroutines can increase memory usage and slow garbage collection.
This metric represents the percentage of swap space in use.
Why it is important:
Vault requires sufficient memory to hold its working data set and if it exhausts available memory it can crash. You should also monitor total available memory to make sure some memory is available for other processes, and swap usage should remain at 0% for best performance.
What to look for:
If sys_bytes exceeds 90% of total_bytes, if mem.used_percent is over 90%, or if swap.used_percent is greater than 0
This metric represents the number of nanoseconds consumed by garbage collection (GC) pauses since Vault started.
nanosecond
sample
vault.runtime.gc_pause_ns.count provides a count of GC pauses.
vault.runtime.gc_pause_ns.lower provides the lower bound for time taken by GC pauses.
Use vault.runtime.gc_pause_ns.mean provides the mean for time taken by GC pauses.
vault.runtime.gc_pause_ns.stddev provides the standard deviation for time taken by GC pauses.
vault.runtime.gc_pause_ns.sum provides the sum of time taken by GC pauses.
vault.runtime.gc_pause_ns.upper provides the upper bound for time taken by GC pauses.
Why it is important:
As mentioned above, GC pause is a stop-the-world event, meaning that all runtime threads are blocked until GC
completes. Normally these pauses last only a few nanoseconds. But if memory usage is high, the Go runtime may GC so frequently that it starts
to slow down Vault.
What to look for:
Warning if total_gc_pause_ns exceeds 2 seconds/minute, critical if it exceeds 5 seconds/minute
This metric represents per-mount-point block device utilization.
Why it is important:
When using integrated storage, Vault disk I/O performance becomes a more critical factor and proactive monitoring and alerting on disk performance for Vault servers is crucial.
When using storage backends other than integrated storage, Vault generally doesn't require too much disk I/O, so a sudden change in disk activity could mean that debug or trace logging has accidentally been enabled in production, which can impact performance.
Too much disk I/O can cause the rest of the system to slow down or become unavailable as the kernel spends all its time waiting for I/O to complete.
What to look for:
Sudden large changes to the diskio metrics (greater than 50% deviation from baseline, or more than 3 standard deviations
from baseline). Over 80% utilization on block device mount points on which Vault data are persisted.
This metric represents a count of failed attempts to log responses to an enabled audit device.
failures
counter
vault.audit.log_response_failure.value provides the number of audit device log response failures since startup.
Why it is important:
These metrics are of utmost importance as a blocked audit device can cause Vault to deliberately stop servicing requests. Review the Blocked Audit Devices documentation for more information.
The general format of the route based metrics is as follows:
vault.route..
These Vault metrics represent the time to handle an operation by a particular mount point. Instead of labels, there is one metric per operation/mount pair.
If measured in telemetry metrics, you can gain a good approximation of response time per API endpoint. The metric originates from slightly lower than HTTP, auditing and some common request processing in the Vault stack, but it's at a higher level than any per-API handling.
The following are some specific examples taken from live metrics.
vault.route.rollback.sys-.mean
Displays the mean time for rollback operations against the system backend.
Another example is vault.route.read.auth-token-, which represents the authentication token read route; here are the available values:
vault.route.read.auth-token-.count provides a count of authentication token reads.
vault.route.read.auth-token-.lower provides the lower bound for time taken by authentication token reads.
Use vault.route.read.auth-token-.mean provides the mean for time taken by authentication token reads.
vault.route.read.auth-token-.stddev provides the standard deviation for time taken by authentication token reads.
vault.route.read.auth-token-.sum provides the sum of time taken by authentication token reads.
vault.route.read.auth-token-.upper provides the upper bound for time taken by authentication token reads.
These are critical operational metrics related to Vault cluster leadership changes, and can help you spot an unhealthy cluster or leadership flapping condition.
This metric represents the duration of time taken by cluster leadership setup failures which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status.
ms
summary
vault.core.leadership_setup_failed.lower provides the lower bound for time taken by cluster leadership setup failures.
Use vault.core.leadership_setup_failed.mean provides the mean for time taken by cluster leadership setup failures.
vault.core.leadership_setup_failed.stddev provides the standard deviation for time taken by cluster leadership setup failures.
vault.core.leadership_setup_failed.sum provides the sum of time taken by cluster leadership setup failures.
vault.core.leadership_setup_failed.upper provides the upper bound for time taken by cluster leadership setup failures.
This metric represents the duration of time taken by cluster leadership losses which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status.
ms
summary
vault.core.leadership_lost.lower provides the lower bound for time taken by cluster leadership losses.
Use vault.core.leadership_lost.mean provides the mean for time taken by cluster leadership losses.
vault.core.leadership_lost.stddev provides the standard deviation for time taken by cluster leadership losses.
vault.core.leadership_lost.sum provides the sum of time taken by cluster leadership losses.
vault.core.leadership_lost.upper provides the upper bound for time taken by cluster leadership losses.
Why it is important:
The measured value of this metric answers the question "how long was this server the leader, when it lost leadership?"
What to look for:
Any count greater than zero means that Vault experienced a leadership change and could potentially be cause for alerting. Do note that a high mean value here is considered better than a low value, because a low value means there is leadership flapping.
If you use Vault Enterprise Replication, then these metrics will be of importance in monitoring for every primary and secondary cluster that participates in replication.
This metric represents the duration of time to perform a Merkle Tree based synchronization using the last delta generated between the clusters participating in replication.
ms
summary
vault.replication.merkleSync.lower provides the lower bound for time to perform a Merkle Tree based synchronization.
Use vault.replication.merkleSync.mean provides the mean for time to perform a Merkle Tree based synchronization.
vault.replication.merkleSync.stddev provides the standard deviation for time to perform a Merkle Tree based synchronization.
vault.replication.merkleSync.sum provides the sum of time to perform a Merkle Tree based synchronization.
vault.replication.merkleSync.upper provides the upper bound for time to perform a Merkle Tree based synchronization.
This metric represents the amount of time required to flush the Vault write-ahead logs (WAL) to the persist queue.
ms
summary
vault.wal.flushReady.lower provides the lower bound for time required to flush the Vault WALs to the persist queue.
Use vault.wal.flushReady.mean provides the mean for time required to flush the Vault WALs to the persist queue.
vault.wal.flushReady.stddev provides the standard deviation for time required to flush the Vault WALs to the persist queue.
vault.wal.flushReady.sum provides the sum of time required to flush the Vault WALs to the persist queue.
vault.wal.flushReady.upper provides the upper bound for time required to flush the Vault WALs to the persist queue.
Why it is important:
The Vault write-ahead logs (WALs) are used to replicate Vault data between clusters. WALs are written and stored even if Enterprise Replication is not currently enabled. The WAL is purged every few seconds by a garbage collector. But if Vault is under heavy load, the WALs may start to accumulate, putting pressure on the storage.
This metric represents the number of all leases which are eligible for eventual expiry.
leases
gauge
vault.expire.num_leases.value provides the total number of leases which are eligible for eventual expiry.
Why it is important:
This value represents an approximate total lease count for Vault across all lease generating auth methods and secrets engines.
What to look for:
Large and unexpected delta in count can indicate a bulk operation, load testing, or runaway client application is generating excessive leases and should be immediately investigated.
These metrics relate to the Consul. If you use this storage backend, you should monitor these metrics.
The metrics below relate to Consul when used as a storage backend. They are available in Vault telemetry. However, for a full list of Consul metrics, refer to the Monitoring Consul Datacenter Health tutorial.
The following are fine-grained usage metrics from Vault telemetry introduced in version 1.5. They are related to common types of usage including identity, lease, secret, and token usage.
These metrics are the most useful for business users to measure Vault usage for metering, billing, and similar use cases.
This metric was introduced in version 1.5.0 and represents a count of leases created by a secret engine (excluding leases created internally for token expiration.)
vault.secret.lease.creation.value provides the number.
This metric was introduced in version 1.5.0 and represents a count of identity entity creation, either from manual creation or automatically upon login with an auth method.
vault.identity.entity.creation.value provides the number.
This metric represents the total number of available file handles, and is provided by the Telegraf agent.
Why it is important:
The majority of Vault operations which interact with systems outside of Vault, for example receiving a connection from another host, sending data between hosts, or writing to disk in the case of integrated storage, require a file descriptor handle.
If either Vault or Consul (in the case of Vault using Consul storage backend) runs out of handles, it will stop accepting connections.
NOTE: By default, process and kernel user limits are fairly conservative. You should increase these beyond the defaults and in line with the values recommended by other HashiCorp resources, such as the example LimitNOFILE value shown for configuring systemd in the Vault Deployment Guide.
What to look for:
When file-nr exceeds 80% of file-max, you should alert operators to take proactive measures for reducing load and at least temporarily increasing user limits.
The following are details about the audit device log data and how you can effectively search them in Splunk.
Get to know the key fields accessible from audit device log data:
type: The type of an audit entry, either "request" or "response". For a successful request, there will always be two events. The audit log event of
type "response" will include the "request" structure, and will have all the same
data as a request entry. For successful requests you can do all your searching
on the events with type "response".
request.path: The path of an API request
(request|response).mount_type: The type of the mount that handles this request or response.
request.operation: The operation performed (eg read, create, delete...)
auth: The authentication information for the caller
.entity_id: If authenticated using an auth backend, the entity-id of the user/service
.role_name: Depending on auth backend, the role name of the user/service
error: Populated (non-empty) if the response was an error, this field contains the error message.
response.data: In the case of a successful response, many will contain a data
field corresponding to what was returned to the caller. Most fields will be
masked with HMAC when sensitive.
You have now been introduced to the most critical Vault operational and usage metrics along with information about monitoring and responding to specific examples.
If you want to review a practical configuration example or try the example in an online tutorial, continue on with the practical example.
This tutorial focuses on the key Vault telemetry. Refer to Vault Limits
and Maximums to understand
the known upper limits on the size of certain fields and objects, and
configurable limits on others.
You can use the information here to build an example monitoring stack built on Telegraf, Fluentd, and Splunk. It demonstrates a complete example solution to help you get started, and inform your own monitoring solution.
Splunk is a popular choice for searching, monitoring, and analyzing application generated data. Fluentd is typically installed on the Vault servers, and helps with sending Vault audit device log data to Splunk. Telegraf agents installed on the Vault servers help send Vault telemetry metrics and system level metrics such as those for CPU, memory, and disk I/O to Splunk. This results in a comprehensive solution to provide insights into a running Vault cluster.
Additionally, a Vault Enterprise Splunk application is available that bundles popular metrics dashboards for operators, security practitioners, and users concerned with metering usage.
While this practical example is provided as a convenience, the information is also a helpful resource for learning about monitoring and alerting on the important data Vault provides to users and operators.
NOTE:Splunk app is available for Vault Enterprise Platform users.
To follow along with the practical example, you must install and configure the following software in a Linux or macOS environment.
Vault - Either the open source version or Enterprise version can be used. Note that the Enterprise trial version will operate for 30 minutes before sealing itself. Install Vault has more details.
Splunk - This example uses the Splunk Enterprise trial version, but the Splunk Cloud or free version will also work. Be aware that the free version is limited to 500MB of daily data ingestion.
Fluentd is used to capture and forward events from an enabled audit device log.
Telegraf is used to capture and forward Vault telemetry metrics and system level metrics; packages are provided for common Linux distributions and Homebrew for macOS.
Generally useful configuration instructions are shared here based on standard installations of the software for Linux or macOS using a combination of command line tools, configuration file editing, and web user interfaces.
You should already be comfortable operating and configuring Vault to follow along with this example. This example presumes that you can install and configure fluentd, Telegraf, and Splunk and then update an existing Vault configuration to add telemetry functionality.
The versions of software used in this example are as follows:
If you'd like to check out using Vault, Fluentd, Telegraf, and Splunk together in a fully pre-configured and hands-on environment, give this online tutorial a try.
It uses the same set of technologies and workflow in a Docker environment.
Online tutorial: An interactive tutorial is also available if you do
not wish to install the following resources. Click the Show Terminal button
to start.
In the data path used for this practical example, Vault exports telemetry metrics, which are rolled up by Telegraf at a configurable interval. The metrics which are then pushed from Telegraf to Splunk take on a slightly different format than the source metric types as defined in go-metrics, which Vault uses for its Telemetry.
Within Splunk, you can expect the following metric type definitions:
Counters are represented in Splunk as <metric>.value
Gauges are represented in Splunk as <metric>.value
Samples are represented in Splunk as <metric>.count, <metric>.mean, <metric>.upper, and so on with each metric measured within the Telegraf collection window
If you have problems getting values from a metric in searches, make sure that you are not forgetting to add the final .value, .mean, etc. to your desired Splunk metric.
Where possible, all examples here include the available metric type definitions that can be used.
There are two major areas of Splunk configuration required to collect and search the audit device log data from Fluentd and the telemetry metrics data from Telegraf.
An Events Index is an index type that is optimized for storage and retrieval of metric data.
An HTTP Event Collector (HEC) and its associated access token lets you securely send audit device logs to Splunk over the HTTP and Secure HTTP (HTTPS) protocols. In the example used in this tutorial, Fluentd will be configured to send these data to Splunk.
A Metrics Index is an index type that is optimized for storage and retrieval of metric data.
An HTTP Event Collector and its associated access token lets you securely send metrics to Splunk over the HTTP and Secure HTTP (HTTPS) protocols. In this example, Telegraf will be configured to send these data to Splunk.
You will also need to disable SSL on the HEC and add the Vault indexes to the Splunk admin role.
You can configure Splunk with Splunk Web, the splunk CLI, or HTTP API. The configuration process is currently detailed here only for Splunk Web or the splunk CLI.
Use a browser to open the Splunk Web interface at http://localhost:8000.
Sign in with the username and password of your admin user.
Configure Splunk to receive and index both Vault audit device log and telemetry data into their corresponding HTTP Event Collectors and index types.
Add the metrics index to contain the telemetry metrics from Telegraf and Vault.
From the Splunk Web navigation menu, select Settings.
From under the Data menu, select Indexes.
Click New Index and you will encounter a dialog like the example shown here.
For Index Name, enter vault-metrics.
For Index Data Type, select Metrics.
Leave all other options at their default values.
Click Save
More information about creating metrics indexes is available in the Splunk documentation for creating metrics indexes.
After creating the indexes, you can proceed to creating the HTTP Event Collectors (HECs) for use by Fluentd and Telegraf for sending audit and metrics data from Vault to Splunk.
NOTE: To keep this example simple, it does not use SSL connections between each component in the stack. In an actual production deployment, you should choose to use SSL for each component depending on your requirements and use case, however.
Splunk enables SSL with a self-signed certificate by default, including for the HEC listeners. To disable SSL on all HEC listeners, access the HEC global settings.
From the Web Splunk navigation menu, select Settings.
Fluentd configuration requires that you installtd-agent on your Vault servers. You must also install the Fluentd Splunk HEC plugin and configure td-agent by editing its configuration file.
Be sure to use the td-agent-gem command to install the fluent-plugin-splunk-enterprise plugin.
Once you have installed Fluentd and the Fluentd Splunk HEC plugin, use an editor to configure Fluentd.
Edit the td-agent.conf configuration file and add this example input source description for the Vault file audit device log.
@type tail
path /vault/logs/vault-audit.log
pos_file /vault/logs/vault-audit-log.pos
@type json
time_format %iso8601
tag vault_audit
@type splunk_hec
host 10.10.42.100
port 8088
token 12b8a76f-3fa8-4d17-b67f-78d794f042fb
The following values need updates to match your environment.
Update these values under <source>.
path is the full path to your Vault audit device log file. Specify an existing file audit device log file here or leave the example as-is if you need to enable an audit device. Instructions for doing so are provided in the Configure Vault section.
pos_file is a similarly named file that Fluentd uses for recording file position.
NOTE: The user that your td-agent is executed as must have read permission to the file named in path and read & write permissions on the file named in pos_file.
Update these values under <match>.
host is the hostname or IP address of your Splunk server.
port is the configured Splunk HTTP Event Listener port number.
token is the HEC token value for the audit device HEC.
After configuring td-agent, start or restart the service as necessary.
$systemctl restart td-agent
You can check the td-agent logs for signs of any issues; if you do not yet have a running Vault server, the td-agent logs will likely contain a repetitive error about the missing audit device log file. This is both expected, and not really an issue as td-agent will keep retrying until it can read the file.
This completes the Fluentd configuration.
You are now ready to configure Telegraf for statsd compatible input from Vault and HTTP output to the Splunk HEC.
Telegraf can act as both a statsd compatible agent, and collect additional metrics of its own. Telegraf provides a range of input plugins to
collect data from common sources.
Enable the most common plugins to monitor CPU, memory, disk I/O, networking, and process status in addition to the input and output plugins for Vault and Splunk respectively.
Here is a complete working example Telegraf configuration.
# Global tags relate to and are available for use in Splunk searches# Of particular note are the index tag, which is required to match the# configured metrics index name and the cluster tag which should match the# value of Vault's cluster_name configuration option value.[global_tags]index="vault-metrics"datacenter="us-east-1"role="vault-server"cluster="vtl"# Agent options around collection interval, sizes, jitter and so on[agent]interval="10s"round_interval=truemetric_batch_size=1000metric_buffer_limit=10000collection_jitter="0s"flush_interval="10s"flush_jitter="0s"precision=""hostname=""omit_hostname=false# An input plugin that listens on UDP/8125 for statsd compatible telemetry# messages using Datadog extensions which are emitted by Vault[[inputs.statsd]]protocol="udp"service_address=":8125"metric_separator="."datadog_extensions=true# An output plugin that can transmit metrics over HTTP to Splunk# You must specify a valid Splunk HEC token as the Authorization value[[outputs.http]]url="http://10.42.10.100:8088/services/collector"data_format="splunkmetric"splunkmetric_hec_routing=true[outputs.http.headers]Content-Type="application/json"Authorization="Splunk 42c0ff33-c00l-7374-87bd-690ac97efc50"# Read metrics about cpu usage using default configuration values[[inputs.cpu]]percpu=truetotalcpu=truecollect_cpu_time=falsereport_active=false# Read metrics about memory usage[[inputs.mem]]# No configuration required# Read metrics about network interface usage[[inputs.net]]# Uses default configuration# Read metrics about swap memory usage[[inputs.swap]]# No configuration required# Read metrics about disk usage using default configuration values[[inputs.disk]]## By default stats will be gathered for all mount points.## Set mount_points will restrict the stats to only the specified mount points.## mount_points = ["/"]## Ignore mount points by filesystem type.ignore_fs=["tmpfs","devtmpfs","devfs","iso9660","overlay","aufs","squashfs"][[inputs.diskio]]# devices = ["sda", "sdb"]# skip_serial_number = false[[inputs.kernel]]# No configuration required[[inputs.linux_sysctl_fs]]# No configuration required[[inputs.net]]# Specify an interface or all# interfaces = ["enp0s*"][[inputs.netstat]]# No configuration required[[inputs.processes]]# No configuration required[[inputs.procstat]]pattern="(vault)"[[inputs.system]]# No configuration required
The telegraf.conf file starts with global options.
The default collection interval is set to 10 seconds and a host tag is included in each metric.
As previously mentioned, Telegraf also allows you to set additional tags on the metrics that pass through it. In this case, you are adding tags for the cluster, index, datacenter, and role.
These tags can then be used in Splunk to filter queries (for example, to create a dashboard showing only servers with the vault-server role, or only servers in the us-east-1 datacenter).
TIP: A full reference to all the available statsd-related options in the Telegraf plugin is available in Telegraf Service Plugin: statsd.
Finally, there are inputs for items like CPU, memory, network I/O, and disk I/O. Most of them don't require any configuration, but make sure
the interfaces list in inputs.net matches the interface names you observe in ip addr or ifconfig output on the appropriate servers.
Exceptions to the default values used here are the global tag for the Splunk vault-metrics index, the input for Vault metrics, and the output plugin to send data to Splunk via HEC.
NOTE: You must set a valid value for the authorization header defined under the [outputs.http.headers] of the [[outputs.http]] plugin section. Replace the example value Splunk 42c0ff33-c00l-7374-87bd-690ac97efc50 with that of your actual HEC token value. Note that the 'Splunk ' prefix is actually a part of the token string and needs to be included.
Once you have configured your Telegraf installation, ensure that it is started. You can also check its log output to ensure there are no issues, and move on to configuring Vault for exporting telemetry metrics.
The cluster_name option is used at the global configuration scope and specifies a label for the Vault cluster. You can use a pre-existing value.
NOTE: The cluster_name value must match that of the cluster value in your Telegraf configuration [global_tags] stanza.
The options contained in the example telemetry stanza break down as follows.
dogstatsd_addr specifies that the statsd protocol-compatible listener (the function is provided by Telegraf) can be reached at the host localhost and port UDP/8125
prometheus_retention_time by specifying a retention time of 0 hours, the Prometheus metrics endpoint is effectively disabled
Once you have Vault configured, you need to start or restart it as required.
After Vault is available for use and you are authenticated to it with a token that has sufficient capabilities to enable an audit device, proceed to enabling the file audit device if you will not be using an existing one.
NOTE: It is not currently possible to enable an audit device in the Vault web UI.
Now that Vault is configured, you can begin to explore the audit device and metrics data in Splunk.
A Splunk App for Monitoring Vault, which consists of pre-built dashboards and reports, is available with Vault Enterprise. Without the app, you can still build your own dashboards from scratch, as all the data sources are available with all versions of Vault. The immediate topics below describe how you can explore metrics and search events. We also provide some example search queries.
You can explore the metric data in Splunk to learn more about what is available.
Click the Splunk Enterprise logo image to reach the Splunk Web home page.
Click Search & Reporting.
Click Analytics.
Click Metrics to drill into the metrics.
From here, you can explore all of the metrics exported by Telegraf, including Vault and system level metrics. By browsing here and finding metrics to chart, you can add them to dashboards for reuse.
Suppose you want to examine the time-to-live assigned to all tokens created in the past 24 hours. Access the search bar, and use the vault.token.creation metric, to obtain the total across all clusters in the index like this example.
| mstats sum(vault.token.creation.value) AS count WHERE index=vault-metrics BY creation_ttl
You should observe something like the example screenshot as a result; note the count is 85 for tokens having a 2 hour TTL, and 201 for tokens having an infinite TTL in the example screenshot.
Selecting a different time range returns the total over that range. The data will be shown as an interactive table that can be sorted by any of the columns.
You can further augment this search to limit it to a particular cluster, a particular auth method, or a particular mount point, or further break down the query by any of these labels.
For example, to examine the mount points which are creating large numbers of long-lived tokens use a search query like this example.
| mstats sum(vault.token.creation.value) AS count WHERE index=vault-metrics AND cluster= AND creation_ttl=+Inf BY mount_point
Be sure to the change example value <your-cluster> to the actual cluster_name of the Vault cluster for which you wish to search the metrics.
You should observe something like the example screenshot as a result; note the count is 201 for tokens issued from the auth/token endpoint in the example screenshot. If there were more auth methods enabled you could expect to also notice those listed here with their respective counts.
To get a time series instead, add 'span=30m" to the end of the query to get one data point per 30 minutes.
Vault Enterprise users can take advantage of a complete Splunk app built with care by the HashiCorp product and engineering teams. It includes powerful dashboards that split metrics into logical groupings targets at both operators and business users of Vault. The Splunk application is available with Vault Enterprise Platform. However, all the data sources leveraged by the application are available with all versions of Vault.
Vault Enterprise users can complete the Splunk app request form to request access to the app.
The following are example dashboards and their metrics from the Vault Enterprise Splunk app along with some example queries that you can use with the App.
NOTE: Refer to this section for step by step instructions on configuring Splunk, Fluentd, Telegraf, and Vault to use the Splunk app.
»Vault Operations Metrics from Telemetry Dashboard
The operations dashboard combines self-reported data from Vault with information from the Telegraf agent on each host. You can filter to a particular time range, limit your view to a particular cluster, and select any subset of the hosts to display.
The seal status of each Vault instance is shown, with sealed instances displayed first. Disk I/O, Network I/O, and CPU statistics are listed for each host that has been selected (by default all on the same graph.) Important thresholds or summary statistics are indicated by a dotted line.
The next pair of graphs shows request latency, as reported by Vault. Requests to secret engines and login requests are plotted separately, with 50th and 90th percentile shown.
The dashboard counts the total number of audit log failures during the selected time window; in a properly functioning system this should be zero. The time a leadership loss was reported appears here, as well as times of any leadership setup failure.
The next selection reports on memory usage and lease count; these memory statistics and their importance are described above.
A source of a high number of leases can be investigated on the usage dashboard. A high rate of lease revocation may explain performance problems.
The final sections of the dashboard report on Vault’s replication and encryption mechanisms.
A larger than normal number of uncollected write-ahead-log entries may indicate a storage bottleneck. These entries are generated even if replication is not currently in use.
The barrier statistics show operations performed at Vault’s encryption barrier; the dashboard reports both the count of operations (left axis) and the average latency (right access). Typically these numbers closely track the operations at the storage layer, but Vault may perform caching or batching to combine or eliminate storage operations.
This dashboard shows metrics about the integrated storage backend, or about the Consul storage backend. The top row shows measurement of operation latencies, in milliseconds, along with 50th and 90th percentiles.
For integrated storage, the next row shows the total count of operations by type, and statistics about storage entry sizes. The maximum size over the entire window is plotted as a dashed line; the blue line represents maximum within each time period, and a gray line represents mean entry size.
The last row shows host I/O metrics relevant to integrated storage: disk I/O throughput, disk usage, and network I/O for each host selected in the drop-down.
The usage metrics page gives information on tokens, secrets, leases and entities. The top panels let you examine the rate of token creation, and the number of tokens available for use in Vault.
Some combinations of filters are not available and will cause the corresponding panels to disappear. You can change the data series used for the time series plots by selecting "Namespace", "Auth methods", "Mount points", "Policies", or "Creation TTL".
For example, if you select "Mount points", then select "Auth method : github" the graphs will show information specific to any enabled GitHub auth methods.
The next section shows a count of the number of key-value secrets stored in Vault, and the rate of lease creation by mount point. These can be filtered by namespace, and the lease creation plot can be filtered by lease TTL or by secret engine.
The identity section gives information on which engines create entities, the total count, and the count by alias.
A final section shows the top 15 most common operations, by type and mount point, within the selected time window.
The “Quota Monitoring” dashboard plots the vault.quota.rate_limit.violation and vault.quota.lease_count.violation metrics that were introduced in Vault 1.5 as part of the resource quotas feature. The percentage utilization is plotted for each lease count quota; note that this metric is only emitted when a lease is allocated, not when it expires or is revoked.
“Where are high TTL tokens created?” can be used to identify the sources of long-TTL tokens. It displays the auth mount points that have created such tokens (via the vault.token.created metric) and queries the audit log for additional information such as client IP address and username.
If you use the Splunk App and would like to customize these dashboards for your own environment, we recommend that you clone them rather than editing in-place.
You can do this from the Dashboards view- select Clone in the Actions drop-down menu to create a copy of the existing dashboard.
This will ensure that when a new version of the Splunk App is available, you will have the latest changes from it.
Note that these example queries will function only if you are using the Vault Enterprise Splunk app. If you encounter an error like the following example
Error in 'SearchParser': The search specifies a macro 'vault_audit_log' that cannot be found. Reasons include: the macro name is misspelled, you do not have "read" permission for the macro, or the macro has not been shared with this application. Click Settings, Advanced search, Search Macros to view macro information.
it means that you do not have the Vault Enterprise Splunk App installed and configured. Ensure that you install the App before proceeding with these examples.
The queries in the Splunk App (described below) use a macro to encapsulate the standard parts of a query, such as which index to search and what sourcetype to match. If you have the Splunk App installed, you may use the vault_telemetry alias to limit queries to metrics from Vault and Telegraf that appear in the vault-metrics index.
Another example use case is understanding the number of tokens with policy "admin" in use in the system. As this is a gauge metric, you cannot simply sum all the values as in the case above. Instead, you need take the most recent value of the metric, as in this example.
| mstats latest(vault.token.count.by_policy) AS count WHERE `vault_telemetry` AND policy=admin BY cluster,namespace earliest=-30m
One subtlety is that the "latest" for a particular combination of labels is not necessarily the same point in time. If a particular policy has stopped being used in namespace "ns1" for example, then a "latest" query matching that namespace may return any point during the time window.
Vault multi-dimensional use metrics do not report all possible zero values, because this would create undue load on the metrics collection. The easiest way to handle this time skew is to limit how far back in time the query may go.
Another pitfall to avoid is that in Splunk, metrics are not automatically summed across all possible label sets. So if you query for the latest token count gauge matching a policy, that gauge will represent only one cluster and one namespace.
Use the BY or WHERE clauses with the entire set of available labels, then sum the resulting values explicitly. For example, we you modify the query above to return all policy counts across namespaces, like this.
| mstats latest(vault.token.count.by_policy) AS count WHERE `vault_telemetry` BY cluster,namespace,policy earliest=-30m | stats sum(count) AS count BY policy
| sort -count
The resulting table counts the number of active tokens by policy, summed over all the clusters and namespaces where a token with a matching policy name appears. As before, you can turn this query into a time series by replacing earliest=-30m with span=30m to see the total within each half-hour window.
Here are a couple examples that use only the audit device log data. The following shows all login events, displaying the token display name, entity ID, and the policies that were attached to the token.
You learned about two sources of Vault operational and usage data in the form of telemetry metrics and audit device logs. You also got to know some of the critical usage and operational metrics along with how to use them in a specific monitoring and graphing stack.
You also learned about a solution consisting of Fluentd, Telegraf, and Splunk for analyzing and monitoring Vault, and were given the opportunity to try it in a comprehensive online tutorial.
Finally, you learned about the Vault Enterprise Splunk app and some of the comprehensive graphs that it includes.