Day 1: Deploying Vault on Kubernetes

Vault on Kubernetes Reference Architecture

This document outlines a reference architecture for deployment of HashiCorp Vault in the context of the Kubernetes cluster scheduler. Those interested in deploying a Vault service consistent with these recommendations should read the upcoming Vault on Kubernetes Deployment Guide which will include instructions on the usage of the official HashiCorp Vault Helm Chart.

It should be noted that, though this document currently only covers Vault Open Source, the only HashiCorp-supported configuration of the persistence layer/secure storage for Vault Enterprise is via HashiCorp Consul Enterprise and thus readers may want to refer to the non-Kubernetes Consul Reference Architecture and Consul Deployment Guide as a general reference. The recommendations in this doc related to Consul deployment are heavily informed by those documents.

The following topics are addressed in this guide:

Kubernetes cluster features & configuration

Federation and cluster-level HA

This document details designs for a resilient, reliable, and highly-available Vault OSS service via effective use of availability zones and other forms of in-region datacenter redundancy. Future versions of this document will include designs optimized for Vault Enterprise Disaster Recovery Replication and multi-datacenter Performance Replication. There is no expectation that your Kubernetes cluster has been configured for Kubernetes-specific forms of multi-datacenter redundancy such as Federation v2 or other third-party tools for improving Kubernetes reliability and disaster recovery. Future updates to the Reference Architecture may take these other technologies into account.

Secure scheduling via RBAC and NodeRestrictions

This document details various cluster scheduling constructs used to ensure the proper spread of Vault and Consul Pods amongst a pool of nodes in multiple availability zones. The same constructs also ensure, for security purposes, that the Vault and Consul Pods do not share a Node with non-Vault and non-Consul Pods. These constructs rely on Kubernetes Node Labels. Historically, the kubelets running on Nodes have been given privileges to modify their own Node labels and sometimes even the labels of other Nodes. This opens the possibility of rogue operators or workloads modifying Node Labels in such as way as to subvert isolation of Consul and Vault workloads from other workloads on the cluster. For this reason the use of the following is required though not covered in this doc:

More details may be found in the Vault on Kubernetes Deployment Guide.

Network-attached storage volumes

For the purposes of this Reference Architecture, the Consul Pods have a mandatory requirement of durable storage via PersistentVolumes and PersistentVolumeClaims. It is also strongly encouraged, bordering on hard requirement, that those volumes by network-attached and capable of being re-bound to new Pods should the original pods holding the volume claim go offline due to permanent Node failure. Although it is possible to deploy the Reference Architecture using PersistentVolumeClaims which are not capable of being re-bound to replacement Pods (ex: hostPath) it will significantly reduce the effectiveness of deploying across multiple availability zones for both Consul and Vault and is thus not recommended.

Examples of network-attached storage which would meet the above requirements include AWS EBS, GCE PD, Azure Disk, and Cinder volumes. Please see the Persistent Volumes documentation for more details.

Infrastructure requirements

Sizing of Kubernetes control plane nodes

The Kubernetes community generally recommends against running non-administrative workloads on control plane/master nodes. Most Kubernetes cluster installers and cloud-hosted Kubernetes clusters disallow scheduling general workloads on the control plane. Even if your cluster allows it, Vault and Consul Pods should not be scheduled on the control plane. As general workloads, neither Vault nor Consul place unusual demands on the control plane relative to other general workloads. For these reasons control plane node sizing is considered outside of the scope of this document and thus no specific recommendations are offered.

Dedicated Nodes/kubelets

Vault Pods should be scheduled to a set of Nodes/kubelets on which no other workloads can be scheduled. This prevents the possibility of co-tenant rogue workloads attempting to penetrate protections provided by the Node operating system and container runtimes to gain access to Vault kernel-locked memory or Consul memory and persistent storage volumes. See below for details.

Sizing of Kubernetes Nodes (kubelets)

The suggested hardware requirements for kubelets hosting Consul and Vault do not vary substantially from the recommendations made in the non-Kubernetes Reference Architecture documents for Consul and Vault. Canonical sizing information can be found here:

The sizing tables as specified at the time of this writing have been reproduced below for convenience:

Sizing for Consul servers:

SizeCPUMemoryDiskTypical Cloud Instance Types
Small2 core4-8 GB RAM25 GBAWS: m5.large
Azure: Standard_D2_v3
GCE: n1-standard-2, n1-standard-4
Large4-8 core16-32 GB RAM50 GBAWS: m5.xlarge, m5.2xlarge
Azure: Standard_D4_3
GCE: n1-standard-8, n1-standard-16

Sizing for Vault servers:

SizeCPUMemoryDiskTypical Cloud Instance Types
Small2 core8-16 GB RAM50 GBAWS: m5.large
Azure: Standard_D2_v3
GCE: n1-standard-4, n1-standard-8
Large4-8 core32-64+ GB RAM100 GBAWS: m5.2xlarge, m5.4xlarge
Azure: Standard_D4_v3, Standard_D8_v3
GCE: n1-standard-16, n1-standard-32

Infrastructure Design

Baseline Node layout

The following diagram represents the initial configuration of Kubernetes Nodes without application of any of the various constructs following sections will leverage for scheduling our Consul and Vault Pods. Although Nodes at the bottom of the diagram are set off visually from Nodes at the top of the diagram, at this point, they represent identical configurations. Note there are three availability zones: Availability Zone 0, Availability Zone 1, Availability Zone 2.

This is the baseline configuration upon which following sections of this doc will build.

img

Consul Server Pods and Vault Server Pods

Limiting our Consul Server Pods and Vault Server Pods to a subset of Nodes

Working from the non-configured baseline in the previous diagram, the set of Nodes must first be partitioned into those where Consul Server Pods and Vault Server Pods will run and those which are available for other non-Vault-related workloads. The Kubernetes constructs of Node Labels and Node Selectors are utilized to notify the scheduler on which Nodes we'd like Consul and Vault workloads to land. Later sections of this document will discuss how to enforce the requirement that this same set of Nodes is dedicated for use by Consul and Vault workloads.

  • Recent versions of Kubernetes will often auto-label Nodes with a set of built-in labels using metadata from the hosting cloud provider. If the cloud provider does not support auto-labeling these labels can be manually populated. Built-in labels in the diagram below are in black text.

  • Its worth noting that 'failure-domain.beta.kubernetes.io/zone' has special meaning within Kubernetes when used as a topologyKey: during scheduling Kubernetes will best-effort spread Pods evenly amongst the specified zones. This doc shows a generic 'az0', 'az1', etc. but, as an example, in AWS this might look like 'ca-central-1' or 'ap-south-1'.

  • In this doc Nodes are are assume to have been provisioned with unique hostnames and thus the built-in label 'kubernetes.io/hostname' can be used, again via topologyKey, to best-effort spread Consul Server Pods and Vault Server Pods on our selected Nodes.

  • Nodes can also be provisioned with custom labels. In the diagram below our custom label denoting a Node is reserved for use by a Vault workload, vault_in_k8s=true, is in blue text. Nodes without that label will not be used for Vault-related workloads and are included in the diagram only to emphasize this point.

img

In the diagram above nine nodes have been labeled with vault_in_k8s: true. This k/v pair is referenced by our nodeSelector to inform Kubernetes where to place our Consul and Vault Pods.

Example:

...
  nodeSelector:
      vault_in_k8s: true
...

The k/v pair and nodeSelector are necessary but not sufficient for our requirements:

  1. There is no guarantee that a single Node will have only a Consul Server Pod or only a Vault Server Pod exclusively.
  2. There is no guarantee that the Consul server and Vault Server Pods will be distributed evenly amongst our availability zones.
  3. There is no guarantee that untrusted Pods will not be scheduled onto Nodes were Consul and Vault are running.

In the next section podAntiAffinity scheduling will resolve #1 and #2 above.

Spread Consul Server Pods and Vault Server Pods across Availability Zones and Nodes

In the previous section a Node Selector was used to request that Consul Server Pods and Vault Server Pods run on a select subset of the available Nodes. The next requirement is that Pods be evenly distributed amongst the availability zones and amongst the selected Nodes. Pods are spread across the availability zones to limit exposure to problems in a particular availability zone. Consul server Pods and a Vault Server Pods must run on separate Nodes to limit exposure to extreme resource pressure in those services. For example, if the Vault service is suffering form unusually high k/v write requests no single node will ever be required to handle both the resulting Vault k/v load and the resulting Consul k/v load. Ensuring the Pods are never co-tenant also makes for much less error-prone and reliable rolling upgrades of Consul Pods, Vault Pods, and Kubernetes itself.

The following scheduling constructs are utilized to ensure the two dimensions of spread mentioned above:

As mentioned in the previous section, spread amongst availability zones is via the 'failure-domain.beta.kubernetes.io/zone'. Spread amongst the Nodes is via the 'kubernetes.io/hostname' key.

Example Pod spec for Consul Server Pod:

...
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: consul
              release: <release name>
              component: server
          topologyKey: kubernetes.io/hostname
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: consul
              release: <release name>
              component: server
          topologyKey: failure-domain.beta.kubernetes.io/zone
...

Example Pod spec for Vault Server Pod:

...
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: vault
              release: <release name>
              component: server
          topologyKey: kubernetes.io/hostname
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: consul
              release: <release name>
              component: server
          topologyKey: kubernetes.io/hostname
...

With the above configs Kubernetes will ensure best-effort spread of Consul Server Pods and Vault Server Pods amongst both AZs and Nodes. Ensuring Node-level isolation of Consul and Vault workloads from general workloads via Taints and Tolerations is covered in the next section.

In the previous section we leveraged Kubernetes anti-affinity scheduling to ensure a desired spread of Consul Server Pods and Vault Server Pods along axes of Availability Zones and Nodes (identified by unique hostname). In this section the Kubernetes constructs of Taints and Tolerations ensure that Vault-related workloads never share a node with non-Vault-related workloads. As mentioned in previous sections Node-level isolation is a safeguard to prevent rogue workloads from penetrating a Node's OS-level and container runtime-level protections for a possible attack on Vault shared memory, Consul process memory, and Consul persistent storage.

Taints

First, the labeled Nodes dedicated for use by Consul Server Pods and Vault Server Pods must be tainted to prevent general workloads from running on them. They are tainted 'NoExecute' so that any running Pods will be removed from the Nodes before we place our intended Pods. The diagram below shows our partitioned nodes with a newly applied taint called taint_for_consul_xor_vault=true:NoExecute. The taint is shown in blue for emphasis.

img

Tolerations

Once a Taint has been placed on a Node a Pod spec must include a Toleration if a Pod is to run on that Node.

Example Pod spec (Consul and Vault) with Toleration:

...
  spec:
    tolerations:
      - key: "taint_for_consul_xor_vault"
    operation: "Equal"
    value: "true"
    effect: "NoExecute"
...

The diagrams below show the various scheduling configurations of our Consul Server Pods and Vault Server Pods to this point. Some things to note:

  1. Node Selector and Tolerations are shown in yellow text for emphasis.
  2. The diagram uses a custom syntax to denote anti-affinity rules as the actual syntax is too verbose to easily fit onto the diagram.

Consul Server Pod:

img

Vault Server Pod:

img

Consul Client Pods

Previous sections focused on proper scheduler configuration for Consul Server Pods and Vault Server Pods. There's an additional Pod type which will be part of the infrastructure: Consul Client Pods. The Consul Client Pod is used by the Vault Server Pod to find the Consul cluster which will be used for Vault secure storage. Consul Client Pods will be scheduled onto each Node which has been dedicated to Vault-related workload hosting. Strictly speaking, there's no requirement that a Consul Server Pod also have a Consul Client Pod but for simplicity's sake we co-schedule a Consul Client Pod everywhere where a Vault Server Pod might be scheduled. This simplification comes at negligible additional resource cost for the Node.

Many of the same scheduling constructs already used will also be leverage for Consul Client Pods however with much less complexity. The Consul Client Pods are scheduled as a DaemonSet but the DaemonSet is limited to the partitioned Nodes via use of Node Selector and Toleration.

Node Selector

As with the Consul Server Pod and Vault Server Pod, the Node Selector is part of the Pod spec and is quite simple:

...
  nodeSelector:
      vault_in_k8s: true
...

This ensures that Pods from the DaemonSet can only be deployed onto Nodes labeled with vault_in_k8s=true. Remember, however, that the dedicated Nodes have a Taint applied. The DaemonSet must include a Toleration.

Tolerations

As with the Consul Server Pods and Vault Server pods, A Toleration is specified for taint_for_consul_xor_vault=true:NoExecute.

Example DaemonSet Pod spec with Toleration:

...
  spec:
    tolerations:
      - key: "taint_for_consul_xor_vault"
    operation: "Equal"
    value: "true"
    effect: "NoExecute"
...

Deployed Infrastructure

[Vault on Kubernetes Security Considerations] All the required Labels, Node Selectors, Taints, Tolerations, and Anti-affinity scheduling are now specified. Upon The diagrams below demonstrate the resulting infrastructure include Nodes and Pod placement.

img

Note that this results in a bit of extra capacity amongst the dedicated Nodes. There is a Node in Availability Zone 2 with a Consul Client but neither a Consul Server Pod nor a Vault Server Pod. That's by design. Remember that Kubernetes is doing a best-effort spread of Pods amongst the AZs and available Nodes. In the case of a Node failure that spare Node is available for re-scheduling of Pods previously on the failed Node.

Exposing the Vault Service

With the resilient Vault service available in the Kubernetes cluster the next question becomes how to expose that Service to Vault Clients running outside of the Kubernetes cluster. There are three common constructs for exposing Kubernetes services to external/off-cluster clients:

Communication between Vault Clients and Vault Servers depends on the request path, client address, and TLS certificates and thus only Load Balancer and Node Port, both Layer 4 proxies, are recommended at this time. Future versions of this document may include details on using Ingress. Both both Load Balancer and Node Port require setting externalTrafficPolicy to 'Local' to preserve Vault Client addresses embedded in the Vault client requests and responses.

Help and Reference