Operations

[Enterprise] Monitoring Vault Replication

Disaster Recovery Replication Setup, and Performance Replication Setup guides walked through the steps to configure Vault replication.

Deployment Topology

This guide focuses on monitoring and troubleshooting the replication setup.

The following topics are addressed in this guide:

» Replication Status Check

The status endpoint provides information on WAL streaming and the Merkle Tree which are basic elements of Vault's replication system. In short, WAL streaming is used for normal ongoing replication, whereas the merkle tree and syncs are used for recovery when replication is too far out of sync to use WALs. For more detail on how these work, see Replication Implementation Details.

Let's assume that you have a cluster 1 which is set to be the performance replication primary, and it has a disaster recovery (DR) site which is cluster 3. The cluster 2 is setup to be the performance replication secondary, and it has a DR site which is cluster 4.

Cluster Relationship

To verify the replication status of those clusters, use the sys/replication/status endpoint.

# Via API
$ curl -s $VAULT_ADDR/v1/sys/replication/status | jq

# CLI
$ vault read -format=json sys/replication/status

Please Note: This will output the performance replication status under "performance" and the disaster recovery replication status under "dr".

» Parameters to check

» On the Primary:

  • cluster_id: Unique ID for this set of replicas. This value should always match on the Primary and Secondary.
  • known_secondaries: List of the IDs of all non-revoked secondary activation tokens created by this Primary. The ID will be listed regardless of whether or not the token was used to activate an actual secondary cluster.
  • last_wal: Index of the last Write-Ahead Logs (WAL) entry written on the local storage. Note that WALs are created for every write to storage.
  • merkle_root: A snapshot in time of the merkle tree's root hash. The merkle_root changes on every update to storage.
  • mode: This should be "primary".
  • primary_cluster_addr: If you set a primary_cluster_addr when enabling replication, it will appear here. If you did not explicitly set this, this field will be blank on the primary. As such, a blank field here can be completely normal.
  • state: This value should be running on the primary. If the value is idle, it indicates an issue and needs to be investigated.

» On the Secondary:

  • cluster_id: Unique ID for this set of replicas. This value should always match on the Primary and Secondary.
  • known_primary_cluster_addrs: List of cluster_addr values from each of the nodes in the Primary's cluster. This list is updated approximately every 5 seconds and is used by the Secondary to know how to communicate with the Primary in the event of a Primary node's active leader changing.
  • last_remote_wal: The last WAL index that the secondary received from the primary via WAL streaming.
  • merkle_root: A snapshot in time of the merkle tree's root hash. The merkle_root changes on every update to storage.
  • mode: This should be "secondary".
  • primary_cluster_addr: This records the very first address that the secondary uses to communicate with the Primary after replication is enabled. It may not reflect the current address being used (see known_primary_cluster_addrs).
  • secondary_id: The ID of the secondary activation token used to enable replication on this secondary cluster.
  • state:
    • stream-wals: Indicates normal streaming. This is the value you want to see.
    • merkle-diff: Indicates that the cluster is determining the sync status to see if a merkle sync is required in order for the secondary to catch up to the primary.
    • merkle-sync: Indicates that the cluster is syncing. This happens when the secondary is too far behind the primary to use the normal stream-wals state for catching up. This state is blocking.
    • idle: Indicates an issue. You need to investigate.

» Are my DR Clusters in Sync?

It is relatively straight forward to tell if DR clusters are in sync since all data is replicated.

When the clusters are fully in sync, you can expect to see:

  • state of secondary will be stream-wals
  • last_remote_wal on the secondary should match (or be very close to) the last_wal on the primary
  • Generally, the merkle_root on the primary and secondary will match

» Are my Performance Clusters in Sync?

Performance replication only replicates a subset of data, so checking to make sure the clusters are in sync is a bit more involved.

Most importantly, the last_wal and last_remote_wal values are NOT always the same for performance replication. This is because the last_wal on the primary tracks all of its WALs, not only the data that is being replicated. (Remember not everything gets replicated to the secondary. For example, tokens.) However, they should become the same momentarily after you write a replicated piece of data to the cluster since the last_wal written to the primary should be the piece of data and the last_remote_wal that the secondary sees via WAL streaming would be the same WAL.

When the clusters are fully in sync, you can expect to see:

  • state of secondary will be stream-wals
  • Right after writing a piece of replicated data, the last_remote_wal on the secondary should match the last_wal on the primary for a short period of time.
  • Generally, the merkle_root on the primary and secondary will match. Keep in mind that the merkle_root changes at every update.

» Important Note

  • When you first set up replication, the last_remote_wal on the secondary will be 0 until a new piece of replicated information is written and replicated. The reason is that the initial sync of data when you first enable replication on a secondary is done via Merkle sync rather than via WAL streaming. Therefore, it does not have any remote WALs until after it bootstraps, enters stream-wals mode, and then receives a new piece of replicated data from the primary via WAL streaming.

» Example - Cluster 1

Cluster 1 is both a Performance and DR primary; therefore, the mode for dr is set to "primary". The same for performance. The state is set to "running" for both dr and performance replication. The Performance and DR outputs for this one node have separate cluster_ids and merkle_roots since each of these forms of replication works independently of the other.

$ vault read -format=json sys/replication/status

{
  ...

  "data": {
    "dr": {
      "cluster_id": "68561173-3a72-e79c-56ba-14ab1da6f26f",
      "known_secondaries": [
        "dr_secondary"
      ],
      "last_wal": 333,
      "merkle_root": "a56eb5f6e01abe9834fa083fe5974564923a1f88",
      "mode": "primary",
      "primary_cluster_addr": "",
      "state": "running"
    },
    "performance": {
      "cluster_id": "f2c8e03c-88ba-d1e5-fd3d-7b327671b4cc",
      "known_secondaries": [
        "pr_secondary"
      ],
      "last_wal": 303,
      "merkle_root": "4632976f88df33c89598ba42a57f1418090fcfc8",
      "mode": "primary",
      "primary_cluster_addr": "",
      "state": "running"
    }
  },
  "warnings": null
}

» Example - Cluster 2

Cluster 2 is a DR primary cluster and a Performance secondary cluster; therefore, its mode for dr is set to "primary"; however, for performance replication, it's set to "secondary". Notice that the secondary displays last_remote_wal rather than last_wal and its state is stream-wals.

The secondary's merkle_root should generally match the merkle_root of its primary. If that is off, you can write a test piece of replicated data to the performance primary and then check to make sure the last_wal on the primary matches (or almost matches) the last_remote_wal on the secondary right afterwards.

$ vault read -format=json sys/replication/status

{
  ...

  "data": {
    "dr": {
      "cluster_id": "559369e0-4897-013a-ed7c-2b817969c643",
      "known_secondaries": [
        "dr_secondary"
      ],
      "last_wal": 920,
      "merkle_root": "5b643128fa4d1bd2f0f28913a6581981436728d9",
      "mode": "primary",
      "primary_cluster_addr": "",
      "state": "running"
    },
    "performance": {
      "cluster_id": "f2c8e03c-88ba-d1e5-fd3d-7b327671b4cc",
      "known_primary_cluster_addrs": [
        "https://primary.example.com:8201"
      ],
      "last_remote_wal": 303,
      "merkle_root": "4632976f88df33c89598ba42a57f1418090fcfc8",
      "mode": "secondary",
      "primary_cluster_addr": "https://primary.example.com:8201",
      "secondary_id": "pr_secondary",
      "state": "stream-wals"
    }
  },
  "warnings": null
}

Also, you can see the replication status via Web UI:

Deployment Topology

» Example - Cluster 3

Cluster 3 is a DR secondary of cluster 1; therefore, its list of known_primary_cluster_addrs should include the cluster address of each node in cluster 1 (or a load balancer that will direct it there).

The DR Secondary must be kept synchronized with its DR primary. The last_wal of primary and last_remote_wal of secondary should be the same (or nearly the same with a little latency introduced by the network).

$ vault read -format=json sys/replication/status

{
  ...

  "data": {
    "dr": {
      "cluster_id": "68561173-3a72-e79c-56ba-14ab1da6f26f",
      "known_primary_cluster_addrs": [
        "https://primary.example.com:8201"
      ],
      "last_remote_wal": 333,
      "merkle_root": "a56eb5f6e01abe9834fa083fe5974564923a1f88",
      "mode": "secondary",
      "primary_cluster_addr": "https://primary.example.com:8201",
      "secondary_id": "dr_secondary",
      "state": "stream-wals"
    },
    "performance": {
      "mode": "disabled"
    }
  },
  "warnings": null
}

» Example - Cluster 4

Cluster 4 is a DR secondary of cluster 2; as in the example for cluster 3, the last_wal of primary and last_remote_wal of secondary should be the same (or nearly the same with a little latency introduced by the network).

$ vault read -format=json sys/replication/status

{
  ...

  "data": {
    "dr": {
      "cluster_id": "559369e0-4897-013a-ed7c-2b817969c643",
      "known_primary_cluster_addrs": [
        "https://perf-secondary.example.com:8201"
      ],
      "last_remote_wal": 920,
      "merkle_root": "5b643128fa4d1bd2f0f28913a6581981436728d9",
      "mode": "secondary",
      "primary_cluster_addr": "https://perf-secondary.example.com:8201",
      "secondary_id": "dr_secondary",
      "state": "stream-wals"
    },
    "performance": {
      "mode": "disabled"
    }
  },
  "warnings": null
}

» Port Traffic Consideration with Load Balancer

Vault generates its own certificates for cluster members. All replication traffic uses the cluster port using these Vault-generated certificates after initial bootstrapping. Because of this, the cluster traffic can NOT be terminated at the cluster port on the load balancer level.

If your platform of choice is AWS, use the Classic Load Balancer (ELB) or Network Load Balancer (NLB) rather than Application Load Balancer (ALB). ALB requires the connection to be terminated and does not support TCP pass through. Therefore, if you are trying to use an ALB for replication traffic, you will run into issues.

If needed, set the primary_cluster_addr to override the cluster address value for replication when you enable the primary.

» WAL Replays

Write-Ahead Logs (WALs) are replayed at startup as well as during a reindex. At startup, the WAL replay is completely blocking of incoming requests (no writes or reads).

During a reindex, the behavior depends on the Vault version:

  • Version 0.11.2 and later: When the reindex is triggered manually via /sys/replication/reindex endpoint, the very last WAL replay of the reindex will block writes, but reads will be allowed. The rest of the reindex is non-blocking. If a reindex occurs at startup (e.g. if index records were lost or manually deleted from the underlying storage), the WAL replay blocks the incoming requests (no writes or reads) until the reindex completes.
  • Version 0.11.1 and earlier: The entire reindex is fully blocking of all operations (no reads or writes).

» Reindex

To keep the replicated data consistent across the primary and secondary, Vault maintains Merkle trees. If replication is in a bad state or data has been removed from the storage backend without Vault's knowledge, you can trigger reindexing of the Merkle tree via the /sys/replication/reindex endpoint. Reindex replication process can restore the potentially out-of-sync Merkle tree from the underlying storage so that the tree reflects the correct state of all encrypted secrets.

This is a powerful tool in a situation where unintentional loss of index records occurred.

NOTE: The time it takes to reindex depends on the number and size of objects in the data store.

» Key Monitoring Metrics

This section explains some of the key metrics to look for that are specific to Vault Replication.

Your Vault configuration must define the telemetry stanza to collect the telemetry. For example:

...
telemetry {
  dogstatsd_addr = "localhost:8125"
  disable_hostname = true
}

It is important to collect the storage backend (Consul) telemetry to monitor the overall health of your Vault cluster. Refer to the Vault Cluster Monitoring guide for more detail.

» Key Metrics

Metric Name Description
logshipper.streamWALs.missing_guard Number of incidences where the starting Merkle Tree index used to begin streaming WAL entries is not matched/found
logshipper.streamWALs.guard_found Number of incidences where the starting Merkle Tree index used to begin streaming WAL entries is matched/found
replication.fetchRemoteKeys Time taken to fetch keys from a remote cluster participating in replication prior to Merkle Tree based delta generation
replication.merkleDiff Time taken to perform a Merkle Tree based delta generation between the clusters participating in replication
replication.merkleSync Time taken to perform a Merkle Tree based synchronization using the last delta generated between the clusters participating in replication
vault.wal_persistwals Time taken to persist a WAL to storage
vault.wal_flushready Time taken to flush a ready WAL to storage
wal.gc.total Total number of WAL on disk
wal.gc.deleted Number of WAL deleted during each garbage collection run

The WAL is purged every few seconds by a garbage collector, but if Vault is under heavy load, the WAL may start to grow, putting a lot of pressure on the storage backend (Consul). To detect back pressure from a slow storage backend, monitor the vault.wal_flushready and vault.wal_persistwals metrics.

» Vault Configuration Consideration

It is recommended to explicitly set Vault's HA parameters in your Vault configuration file. Often, it is not necessary to configure api_addr and cluster_addr when using Consul as Vault's storage backend as Consul will attempt to automatically discover and advertise the address of the active Vault nodes. However, when you have a load balancer in front of Vault, it is necessary to explicitly set these parameter values (and you typically have a load balancer in larger deployment).

Example:

listener "tcp" {
  address          = "0.0.0.0:8200"
  cluster_address  = "10.1.42.201:8201"
  tls_disable      = "true"
}

storage "consul" {
  address = "127.0.0.1:8500"
  path    = "vault/"
}

telemetry {
  dogstatsd_addr = "localhost:8125"
  disable_hostname = true
}

api_addr = "http://10.1.42.201:8200"
cluster_addr = "https://10.1.42.201:8201"
  • api_addr: Specifies the full URL to advertise to other Vault servers in the cluster for client redirection.
  • cluster_addr: Specifies the full URL to advertise to other Vault servers in the cluster for request forwarding.

» Update Replication Primary

Consider a scenario where the performance primary (cluster 1) has two performance secondaries (clusters 2 and 5).

When an unexpected event causes cluster 1 to become inoperative, you would need to promote one of the performance secondaries to be the new primary. At the same time, cluster 3 needs to be promoted to be DR primary.

Replication Groups

After one of the performance secondaries was promoted, you still need to update the other performance secondary to point to the new primary.

» Hint & Tips

When you update the replication primary:

  • Use the primary_api_addr parameter to specify the primary's API address if you received error unwrapping secondary token error from the update-primary operation.

  • Ensure that your token(activation token) hasn't been expired. If the token has been expired, you would receive Bad Request error.

Some other scenarios where you may need to use the update-primary endpoint are:

  • If you have network issues between the Primary's active and standby nodes, the secondary may have an incomplete list of known_primary_cluster_addrs. If this list of known addresses is incomplete at the time of a Primary leadership change and it does not include the new Primary's active node address, your secondary will start seeing errors indicating that the connection is refused. In this scenario, you will need to use update-primary to inform the secondary of its new primary. If you run into this situation, it's important to also investigate the network problems on the Primary in order to make sure that it's able to collect all of the standby nodes' cluster information to give to the secondary.

» Help and Reference