Vault 1.2 first introduced an internal storage backend, Integrated Storage, as a technical preview in addition to supported external storage types. (Integrated Storage became generally available in Vault 1.4.) Using the Integrated Storage, data gets replicated to all the nodes in the cluster using the raft consensus protocol. The management of the nodes in the cluster was a manual process.
Vault 1.7 introduced autopilot to simplify and automate the cluster management for Integrated Storage. The autopilot includes:
- Cluster node health check
- Server stabilization: prevent disruption to raft quorum due to an unstable new node
- Monitor newly added node health for a period and decide promotion to voter status
- Dead server cleanup - periodic, automatic clean-up of failed servers
Autopilot is enabled by default upon upgrading to Vault 1.7. Server stabilization works by default, but you need to enable the dead server cleanup explicitly which you will learn in the autopilot configuration section.
»Prerequisites
This tutorial requires Vault, sudo access, and additional configuration to create the cluster.
»Scenario setup
To demonstrate the autopilot feature, you will start 6 Vault instances, each listens to a different port as shown in the diagram below.
- vault_1 (
http://127.0.0.1:8100
) is initialized and unsealed. The root token creates a transit key that enables the other Vaults auto-unseal. This Vault server is not a part of the cluster. - vault_2 (
http://127.0.0.1:8200
) is initialized and unsealed. This Vault starts as the cluster leader. An example K/V-V2 secret is created. - vault_3 (
http://127.0.0.1:8300
) is started and automatically joins the cluster viaretry_join
. - vault_4 (
http://127.0.0.1:8400
) is started and automatically joins the cluster viaretry_join
. - vault_5 (
http://127.0.0.1:8500
) is started and automatically joins the cluster viaretry_join
. - vault_6 (
http://127.0.0.1:8600
) is started and automatically joins the cluster viaretry_join
.
If this is your first time setting up a Vault cluster with integrated storage, go through the Vault HA Cluster with Integrated Storage tutorial.
Retrieve the configuration by cloning or downloading the
hashicorp/vault-guides
repository from GitHub.Clone the repository.
Or download the repository.
This repository contains supporting content for all of the Vault learn tutorials. The content specific to this tutorial can be found within a sub-directory.
Change the working directory to
vault-guides/operations/raft-autopilot/local
.Set the
run-all.sh
file to executable.Execute the
run_all.sh
script to spin up a Vault cluster with 5 nodes.You can find the server configuration files and the log files in the working directory.
Verify the cluster.
The
vault_2
is the leader.
»Understand the autopilot behavior
View the help message for the
vault operator raft autopilot
command.Display the current cluster status.
This displays the overall health of the cluster, and its failure tolerance.
The current leader node is vault_2
. The Failure Tolerance is 2
;
therefore, you can lose up to 2 nodes and still maintain the quorum. The
healthy parameter value is true
for all nodes in the cluster.
Refer to the deployment table for the quorum size and failure tolerance for various cluster sizes.
»Stop one of the nodes
Set the
cluster.sh
file to executable.Stop
vault_6
.Optional: You can verify that
vault_6
is not running.Check the cluster health.
Notice that the Healthy state of the cluster is
false
, and the Failure Tolerance is now1
.Now, the Healthy parameter value is
false
on the cluster, and the Failure Tolerance is1
. The Healthy state of thevault_6
isfalse
; therefore, you know which node failed.Although
vault_6
is no longer running, it is still a cluster member at this point.
»Autopilot configuration
Check the autopilot settings to see the default behavior.
Parameter | Description |
---|---|
Cleanup Dead Servers (bool ) | Specifies automatic removal of dead server nodes periodically. |
Last Contact Threshold (string ) | Limit the amount of time a server can go without leader contact before being considered unhealthy. |
Dead Server Last Contact Threshold (string ) | Limit the amount of time a server can go without leader contact before being considered failed. |
Server Stabilization Time | Minimum amount of time a server must be stable in the 'healthy' state before being added to the cluster. |
Min Quorum (int ) | Minimum number of servers allowed in ca cluster before autopilot can prune dead servers. |
Max Trailing Logs (int ) | Maximum number of log entries in the Raft log that a server can be behind its leader before being considered unhealthy. |
Check the current autopilot configuration.
The Cleanup Dead Servers parameter is set to
false
.Update the autopilot configuration to enable the dead server cleanup. For demonstration, set the Dead Server Last Contact Threshold to 10 seconds, and the Server Stabilization Time to 30 seconds.
Verify the configuration.
Check the cluster health.
The cluster's Healthy parameter value is back to
true
. Notice thatvault_6
is no longer listed. The Voters parameter listsvault_2
throughvault_5
.Check the cluster peers to double-chck.
»Add a new node to the cluster
Explore how the autopilot configuration settings influence the cluster when you add a new node.
Add a new node (
vault_7
) to the cluster.List the cluster members.
Notice that the
vault_7
server is a non-voter. (The Voter parameter value isfalse
.)Check the cluster health.
The
vault_7
server joins the cluster as a non-voter until the Server Stabilization Time of 30 seconds elapses.Wait for 30 seconds and check the cluster peers.
Now, the
vault_7
server should be a voter. This is a part of the server stabilization mechanism of the autopilot.
Vault Enterprise: The explicit non-voter nodes behave the same way as before and remain non-voters as designed. If the dead server cleanup is enabled, it will prune failed non-voters.
»Configure the state check interval
By default, the autopilot picks up any state change an interval of 10 seconds.
To change the default, set the autopilot_reconcile_interval
parameter inside
the storage
stanza in the server configuration file.
Example: The following server configuration file sets the autopilot to picks up state change an interval of 15 seconds.
»Clean up
The cluster.sh
script provides a clean
operation that removes all services,
configuration, and modifications to your local system.
Clean up your local workstation.
»Help and reference
- Integrated Storage Internal documentation
- Integrated Storage Concepts documentation
- Commands (CLI) - operator raft autopilot
- API docs - sys/storage/raft/autopilot
- Vault HA Cluster with Integrated Storage tutorial
- Preflight Checklist - Migrating to Integrated Storage
»Vault Enterprise Replication
If you are running Vault Enterprise with replication enabled, read the Replication section in the Autopilot documentation for additional information.
The following tutorials walk through the Enterprise Replication setup: