Vault with Consul storage reference architecture

15min
|
Vault
Consul

This guide applies to Vault versions 1.7 and above and Consul versions 1.8 and above.

This guide describes recommended best practices for infrastructure architects and operators to follow when deploying Vault using the Consul storage backend in a production environment.

This guide includes general guidance as well as specific recommendations for popular cloud infrastructure platforms.

For production

Integrated Storage native to Vault is now recommended rather than using Consul for Vault storage. Use Consul for Vault storage only when there are clear reasons.

Kubernetes users

If you are deploying Vault to Kubernetes, please refer to the Vault on Kubernetes deployment guide.

Recommended architecture

The following diagram shows the recommended architecture for deploying a single Vault cluster using Consul storage using the Enterprise release of both Vault and Consul:

Recommended architecture diagram

In this architecture, the primary availability risk is to the storage layer. With six nodes in the Consul cluster distributed between three availability zones configured as Consul Redundancy Zones with three voting members and three non-voting members, this architecture can withstand the loss of up to three Consul nodes or the loss of an entire availability zone and remain available. Since Vault uses only a single active node, the Vault cluster only needs three cluster members to withstand the loss of two nodes or an entire availability zone.

If deploying to three availability zones is not possible, the same architecture may be used across two or one availability zones, at the expense of significant reliability risk in case of an availability zone outage.

Additional resiliency is possible by implementing a multi-cluster architecture, which allows for additional performance and disaster recovery options. See the Multi-Cluster Architecture Guide for more information.

The following diagram shows the recommended architecture for deploying a single Vault cluster using Consul storage using the Open Source release of both Vault and Consul:

Recommended architecture diagram

In this architecture, the primary availability risk is to the storage layer. With five nodes in the Consul cluster distributed between three availability zones, this architecture can withstand the loss of two nodes from within the cluster or the loss of an entire availability zone and remain available. Since Vault uses only a single active node, the Vault cluster only needs three cluster members to withstand the loss of two nodes or an entire availability zone.

For Vault Enterprise customers, additional resiliency is possible by implementing a multi-cluster architecture, which allows for additional performance and disaster recovery options. See the Multi-Cluster Architecture Guide for more information.

It is important to use a dedicated Consul cluster for Vault storage, separate from any Consul cluster used for other purposes, to minimize resource contention on the storage layer. This will likely necessitate using non-default ports for Consul network connectivity. In this architecture, ports 7300 and 7301 have been used rather than the defaults of ports 8300 and 8301.

System requirements

This section contains specific hardware capacity recommendations, network requirements, and additional infrastructure considerations. Since every hosting environment is different and every customer's Vault usage profile is different, these recommendations should only serve as a starting point from which each customer's operations staff may observe and adjust to meet the unique needs of each deployment.

Warning

All specification outlined in this document are minimum recommendations without any reservations toward vertical scaling, redundancy or other SRE needs and without measure of your user volumes or their use-cases in all scenarios. All resource requirements are directly proportional to the operations being performed by the Vault cluster as well as the end users utilisation.

Note

To match your requirements and maximise the stability of your Vault instances, it's important to ensure that you are performing load tests and continuing to monitor resource usage as well as all reported matricies from Vaults telemetry.

Hardware sizing for Vault servers

Sizing recommendations have been divided into two common cluster sizes.

Small clusters would be appropriate for most initial production deployments or for development and testing environments.

Large clusters are production environments with a consistently high workload. That might be a large number of transactions, a large number of secrets, or a combination of the two.

Size	CPU	Memory	Disk Capacity	Disk IO	Disk Throughput
Small	2-4 core	8-16 GB RAM	100+ GB	3000+ IOPS	75+ MB/s
Large	4-8 core	32-64 GB RAM	200+ GB	3000+ IOPS	125+ MB/s

For each cluster size, the following table gives recommended hardware specs for each major cloud infrastructure provider.

Provider	Size	Instance/VM Types	Disk Volume Specs
AWS	Small	`m5.large`, `m5.xlarge`	100+GB `gp3`, 3000 IOPS, 125MB/s
	Large	`m5.2xlarge`, `m5.4xlarge`	200+GB `gp3`, 5000 IOPS, 125MB/s
Azure	Small	`Standard_D2s_v3`, `Standard_D4s_v3`	1024GB* `Premium_LRS`
	Large	`Standard_D8s_v3`, `Standard_D16s_v3`	1024GB* `Premium_LRS`
GCP	Small	`n2-standard-2`, `n2-standard-4`	500GB* `pd-balanced`
	Large	`n2-standard-8`, `n2-standard-16`	1000GB* `pd-ssd`

Note

For GCP and Azure recommendations, the disk sizes listed are larger than the minimum size recommended, because for the recommended disk type, available IOPS increases with disk capacity, and the listed sizes are necessary to provision the required IOPS.

Note

For predictable performance on cloud providers, it's recommended to avoid "burstable" CPU and storage options (such as AWS t2 and t3 instance types) whose performance may degrade rapidly under continuous load.

Hardware sizing for Consul servers

Size	CPU	Memory	Disk Capacity	Disk IO	Disk Throughput
Small	2-4 core	8-16 GB RAM	100+ GB	3000+ IOPS	75+ MB/s
Large	4-8 core	32-64 GB RAM	200+ GB	10000+ IOPS	250+ MB/s

For each cluster size, the following table gives recommended hardware specs for each major cloud infrastructure provider.

Provider	Size	Instance/VM Types	Disk Volume Specs
AWS	Small	`m5.large`, `m5.xlarge`	100+GB `gp3`, 3000 IOPS, 125MB/s
	Large	`m5.2xlarge`, `m5.4xlarge`	200+GB `gp3`, 10000 IOPS, 250MB/s
Azure	Small	`Standard_D2s_v3`, `Standard_D4s_v3`	1024GB* `Premium_LRS`
	Large	`Standard_D8s_v3`, `Standard_D16s_v3`	1024GB* `Premium_LRS`
GCP	Small	`n2-standard-2`, `n2-standard-4`	500GB* `pd-balanced`
	Large	`n2-standard-8`, `n2-standard-16`	1000GB* `pd-ssd`

Note

Hardware considerations

In general, CPU and storage performance requirements will depend on the customer's exact usage profile (eg, types of requests, average request rate, and peak request rate). Memory requirements depend on the total size of data stored in memory and should be sized according to that data.

Hashicorp strongly recommends configuring Vault with audit logging enabled. The impact of the additional storage I/O from audit logging will vary depending on your particular pattern of requests. For best performance, audit logs should be written to a separate disk.

Network latency and bandwidth

In order for cluster members to stay properly in sync, network latency between availability zones should be less than eight milliseconds (8 ms).

The amount of network bandwidth used by Vault and Consul will depend entirely on the specific customer's usage patterns. In many cases, even a high request volume will not translate to a large amount of network bandwidth consumption. However, all data written to Vault will be replicated to all Consul cluster members. It's also important to consider bandwidth requirements to other external systems such as monitoring and logging collectors. And finally, a multi-cluster Vault setup will require Vault datasets to be transmitted between clusters to provide Performance and DR Replication.

Network connectivity

The following table outlines the network connectivity requirements for Vault cluster nodes when using Consul storage. If general network egress is restricted, particular attention must be paid to granting outgoing access from the Vault servers to any external integration providers (for example, authentication and secret provider backends) as well as external log handlers, metrics collection, security and config management providers, and backup and restore systems.

Source	Destination	port	protocol	Direction	Purpose
Client machines	Load balancer	443	tcp	incoming	Request distribution
Load balancer	Vault servers	8200	tcp	incoming	Vault API
Vault servers	Vault servers	8200	tcp	bidirectional	Cluster bootstrapping
Vault servers	Vault servers	8201	tcp	bidirectional	Raft, replication, request forwarding
Vault servers	External systems	various	various	various	External APIs
Consul and Vault servers	Consul servers	7300*	tcp	incoming	Consul server RPC
Consul and Vault servers	Consul and Vault servers	7301*	tcp, udp	bidirectional	Consul LAN gossip

Note

Ports for Consul RPC and gossip traffic are different than the defaults in this architecture.

Network traffic encryption

All Vault-related network traffic should be encrypted along every segment. From client machines to the load balancer, and from the load balancer to the Vault servers, standard HTTPS TLS encryption can be used.

For communication between Vault servers (port 8201 by default) for request forwarding traffic, Vault automatically negotiates an mTLS connection when new servers join the cluster initially via the API address port (8200 by default).

For communication between Consul agents on the Vault and Consul clusters, it is strongly recommended to configure gossip encryption, which is covered in the Deployment Guide.

Load balancer recommendations

For the highest levels of reliability and stability, it is highly recommended to use some load balancing technology to distribute requests to your Vault cluster members. Each major cloud platform provides good options for managed load balancer services, or there are a number of self-hosted options as well as service discovery systems like Consul.

If you choose to terminate TLS at your load balancer, it is also strongly recommended to use TLS for the connection from the load balancer to Vault as well to minimize the exposure of secret content on your network.

To monitor the health of Vault cluster nodes, the load balancer should be configured to poll the /v1/sys/health API endpoint to detect the status of the node and direct traffic accordingly. Refer to the sys/health API documentation for specific details on the query options and response codes and their meanings.

Scaling considerations

In a cloud-based environment, it is recommended to use a managed scaling service (such as Auto Scaling Groups on AWS) to keep your Vault and Consul clusters populated with healthy instances. However, it's important not to replace all Consul instances in the managed scaling group too quickly which risks data loss.

For scaling the performance of your Vault cluster, there are two factors to consider. Adding additional members to the Vault cluster will not increase performance for any activity that triggers writes to the Vault storage backend. However, for Vault Enterprise customers, adding performance standby nodes can provide horizontal scalability for read requests within a Vault cluster.

Failure tolerance characteristics

When deploying a Vault cluster, it's important to consider and design for your specific requirements for various failure scenarios:

Node failure

In a high-availability Vault cluster using Consul storage, all data is stored in the Consul cluster, and so the failure of a Vault node does not risk data loss. To determine Vault cluster leadership, one of the Vault servers obtains a lock within the Consul data store to become the active Vault node.

If at any time the leader is lost, another Vault node will take its place as the cluster leader. To allow for the loss of two Vault nodes, the minimum recommended Vault cluster size is three.

Consul achieves replication and leadership through the use of its consensus and gossip protocols. In these protocols, a leader is elected by consensus and so a quorum of active servers must always exist. To allow for the loss of two nodes from the Consul cluster, the minimum recommended size of the Consul cluster is five nodes.

Availability zone failure

By deploying Vault and Consul cluster members in the recommended architecture across three availability zones, the overall architecture can tolerate the loss of any single availability zone.

In cases where deployment across three zones is not possible, the failure of an availability zone may cause the Vault cluster to become inaccessible or for the Consul cluster to be unable to elect a leader. In a two availability zone deployment, for example, the failure of one availability zone would have a 50% chance of causing the Consul cluster to lose its Raft quorum and be unable to service requests.

Region or cluster failure

In the event of a failure of an entire region or cluster, Vault Enterprise provides replication features that can help provide resiliency across multiple clusters and/or regions. Please see the Multi-Cluster Architecture Guide for more information.

External token storage

The Tokenization transformation feature reached General Availability in Vault 1.7. This feature introduces additional architectural considerations.

The tokenization feature requires an external data store to facilitate the mapping of tokens to cryptographic values. Be sure to architect your external data stores for high availability. Where possible, it's important to follow reliability and disaster-recovery architectural patterns that meet the same requirements you have for Vault itself. And in order to ensure data consistency the external data store backup cadence must be in sync with backups of Vault.

Glossary

Vault cluster

A Vault cluster is a set of Vault processes that together run a Vault service. These Vault processes could be running on physical or virtual servers or in containers.

Availability zone

An availability zone is a single network failure domain that hosts part or all of a Vault cluster. Examples of availability zones include:

An isolated datacenter
An isolated cage in a datacenter if it is isolated from other cages by all other means (power, network, etc)
An "Availability Zone" in AWS or Azure; A "Zone" in GCP

Region

A region is a collection of one or more availability zones on a low-latency network. Regions are typically separated by significant distances. A region could host one or more Vault clusters, but a single Vault cluster would not be spread across multiple regions due to network latency issues.

Autoscaling

Autoscaling is the process of automatically scaling computational resources based on service activity. Autoscaling may be either horizontal, meaning to add more machines into the pool of resources, or vertical, meaning to increase the capacity of existing machines.

Each major cloud provider offers a managed autoscaling service:

Cloud	Managed Autoscaling Service
AWS	Auto Scaling Groups
Azure	Virtual Machine Scale Sets
GCP	Managed Instance Groups

Load balancer

A load balancer is a system that distributes network requests across multiple servers. It may be a managed service from a cloud provider, a physical network appliance, a piece of software, or a service discovery platform such as Consul.

Each major cloud provider offers one or more managed load balancing services:

Cloud	Layer	Managed Load Balancing Service
AWS	Layer 4	Network Load Balancer
	Layer 7	Application Load Balancer
Azure	Layer 4	Azure Load Balancer
	Layer 7	Azure Application Gateway
GCP	Layer 4/7	Cloud Load Balancing

Additional references

Next steps

Collection Overview

Consul storage backend

Vault multi-cluster architecture guide

This tutorial also appears in:

3 tutorials

Vault
Best practices for infrastructure architects and operators to follow to deploy Vault in a zero trust security configuration.