Disaster recovery for federated primary datacenter

14min
|
Consul

Operating a multi-datacenter federated environment requires you to prepare for many possibilities, including a complete outage of one of your physical datacenters, or a cloud provider outage that might make one of the components of your environment temporarily or permanently unavailable.

This can have different implications depending on your configuration, but it is something you want to be prepared for, in case the unthinkable happens.

This tutorial guides you through the process of creating a recovery strategy from a quorum loss to the complete loss of your primary datacenter.

The recommended logical steps are listed in three main sections, the first will help you prepare for an outage event, the second will help you restore your primary datacenter, and the third will help you validate the restore was successful and ensure federation is re-established.

Prerequisites

Base configuration

To complete this tutorial, you will need two wide area network (WAN) joined Consul datacenters with access control list (ACL) replication enabled. If you are starting from scratch, follow these tutorials to set up your datacenters, or use them to check that you have the proper configuration in place:

You will also need to enable ACL replication, which you can do by following the steps in the ACL Replication for Multiple Datacenters tutorial.

Mesh gateways

If you want to make use of the mesh gateway functionality, you can follow Connect Services Across Datacenters with Mesh Gateways to setup them for the two datacenters.

Central service configuration

Lastly, you should set enable_central_service_config = true on your Consul clients which will allow you to centrally configure the sidecar and mesh gateway proxies.

Disaster recovery process

The process can be summarized in the following phases:

Formulate a disaster preparation strategy
Outage event occurs for the primary datacenter
Restore primary datacenter
Restore and validate federation

Disaster preparation strategy

Use a recommended architecture

Not every outage has the same level of impact. A lot of the resiliency of your Consul federation will rely on proper configuration and the adoption of a recommended architecture.

Refer to Consul Reference Architecture to learn the recommended configurations for your Consul datacenter.

Automate your deployment

Re-deploying an entire datacenter after an outage is a non-trivial operation that might require a considerable amount of time. You can reduce the time to recover by following Infrastructure as Code (IaC) principles and using tools such as Terraform and Vault to help you in the deployment and recovery process.

The amount of downtime you experience from the loss of your primary datacenter is directly proportional to the amount of time it takes you to deploy a new primary datacenter. Reducing the deployment time via automation will not only help you standardize the process and reduce errors, but also reduce the downtime you will experience after a datacenter loss.

Have a backup strategy

A Consul datacenter's state is more than just the initial configuration, it includes data that is generated during normal operations such as KV entries, ACL tokens, and intentions. When your datacenter fails, this information is lost and cannot be manually recreated without a backup. For example you may keep you Consul agent and service configuration in version control, however, you will not be able to as easily recover security credentials since credentials generated during the installation process are now different from the ones that were present in the datacenter that was “lost” in the disaster.

The process of then updating every secret in Consul to reflect the updated value is tedious, error prone and hard to debug.

Restoring from a snapshot ensures all the intentions, KV entries and ACL tokens are reintroduced.

You can follow Backup Consul Data and State to learn how to perform a snapshot of your Consul datacenter to use in case of disaster.

Have a TLS certificate distribution process in place

Certificates are stored on the agent disk and not saved in a snapshot. This means you will have to re-generate them.

Consul comes equipped with a command ,consul tls, that permits you to generate TLS certificates for the agents. This simplifies the automation of deployment by giving you the ability to generate a CA and TLS certificates as part of the process.

As an alternative, we suggest you use Vault as a CA and TLS certificate generator to help you automate the process. You can follow Generate mTLS Certificates for Consul with Vault to learn how to automate certificate generation and distribution for your Consul server agents.

Adopt an adequate ACL down policy

When your primary datacenter is down you lose your ability to validate ACL policies. To mitigate this Consul has a configuration parameter, down_policy, that tells Consul which strategy to follow if ACLs cannot be validated against the primary datacenter.

By default, Consul adopts the extend-cache approach, meaning that in case of an outage Consul will allow cached ACL objects to be used, ignoring their TTL values. If a non-cached ACL is used, extend-cache acts like deny.

If you changed the down_policy to the more restrictive value of deny, you will be impacted more severely from the outage, since all ACL protected operations in the secondary datacenter will be denied until the primary datacenter is restored.

Outage event in the primary datacenter

Depending on your configuration, multiple levels of outage can occur.

This tutorial covers how to handle outages that render the primary datacenter unable to serve requests, including service discovery requests to Consul DNS and intention validation. This can happen in many possible ways, but the two extremes of the spectrum are the following:

Loss of quorum in the primary datacenter This is an outage where the datacenter has less than (N/2)+1 servers available, where N is the total number of servers. While some of the nodes are unaffected, not enough are available to form a quorum.
Complete loss of the primary datacenter This is either a disaster event that completely wipes an entire facility or a major outage of your cloud provider.

Loss of quorum in the primary datacenter

In the scenario where you have lost enough servers in your primary datacenter that a quorum cannot be reached, you will experience the same level service failure as if the whole datacenter is unavailable.

If the outage is limited to the server nodes, or did not severely impact your client fleet, it might not be practical to rebuild the whole datacenter only to re-establish a quorum.

In this scenario, you can follow the Outage Recovery tutorial to restore your server nodes and make sure they are able to reform the raft cluster and elect a leader.

Once the raft cluster is reformed, and the datacenter is again operational, you can now restart any client agents that might have been affected by the outage.

Complete loss of the primary datacenter

The worst case scenario in a federated environment is the loss of the primary datacenter.

In this scenario, the course of action should aim to restore the lost datacenter as quickly as possible. The following sections in the tutorial will guide you through the necessary steps to perform a restore and to make sure all functionality is re-established.

Restore primary datacenter

The steps necessary to restore your environment after a full outage of your primary datacenter are the following:

Restore your primary datacenter
Restore the last snapshot to the newly recovered datacenter
Apply agent tokens to the servers
Perform a rolling restart of the servers in the primary datacenter
Perform a rolling restart of the clients in the primary datacenter
Restore and validate federation

Restore datacenter nodes

You can use the same process you used for the initial deploy of your Consul datacenter.

To make sure the new deployment is able to communicate with the secondary datacenter, you will need to ensure that you are using the same

CA certificate
CA key
Gossip encryption key

as the ones you used for the pre-existing deployment.

Restore snapshot

After the datacenter has been restored, you can restore the snapshot to it using the consul snapshot command.

$ consul snapshot restore -token=<value> backup.snap

Restored snapshot

Note

The token used for the snapshot restore procedure needs to be a token valid for the newly restored datacenter. If your restored datacenter does not have ACL enabled, you can restore the snapshot without a token.

Set Consul ACL agent tokens

The newly restarted nodes will now be missing the ACL tokens needed to successfully join the datacenter. Once you restored the ACL system with the snapshot restore you will be able to set the tokens for the different nodes using the consul acl command.

Note

The following operations require a token with acl = write privileges. Also, after the snapshot is restored the ACL system will now contain the previous tokens created before the outage. You can use the any management token you had in your datacenter before the outage. To setup the token for the request you can either use the CONSUL_HTTP_TOKEN environment variable or to pass it directly using the -token= command parameter.

First retrieve the tokens available in the datacenter.

$ consul acl token list

...

AccessorID:       694f15e2-b8f9-c5dd-cb92-8d7bc529df9f
Description:      server-1 agent token
Local:            false
Create Time:      2021-01-07 21:38:25.331288761 +0000 UTC
Legacy:           false
Policies:
   a660e45f-3f6e-501c-1303-3d39a83f6ff9 - acl-policy-server-node

...

Then retrieve the token using the AccessorID.

$ consul acl token read -id 694f15e2-b8f9-c5dd-cb92-8d7bc529df9f

AccessorID:       694f15e2-b8f9-c5dd-cb92-8d7bc529df9f
SecretID:         237f1a27-3399-ebf0-29f1-9828d89159b1
Description:      server-1 agent token
Local:            false
Create Time:      2021-01-07 21:38:25.331288761 +0000 UTC
Policies:
   a660e45f-3f6e-501c-1303-3d39a83f6ff9 - acl-policy-server-node

Note

You will have to set the token on all the server nodes in order to get them able to re-join the datacenter successfully. Depending on your configuration you might have different tokens for each server or re-use the same token for all server agents but in both cases the token needs to be set on all agents otherwise they will not be able to successfully join the datacenter.

Finally apply the token to the server.

$ consul acl set-agent-token agent 237f1a27-3399-ebf0-29f1-9828d89159b1

ACL token "agent" set successfully

Perform a rolling restart of the servers

After the snapshot is restored and the tokens have been set on the nodes, you will observe errors in the server logs about duplicate node ids. This happens because the servers received new node ids when they were reinstalled, and these node ids are different from the ones stored in the snapshot.

...
[WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "835aa73d-78e9-ba63-d2da-8bbfba1329a7": Node name server-1 is reserved by node 4d2d0bfa-6d2d-c373-fff2-16ac612dca7a with name server-1 (172.19.0.3)"
[WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "49be2708-2f0b-0a1b-0069-9e1420fa251b": Node name server-2 is reserved by node ea086564-8d58-2918-682d-9092651f2157 with name server-2 (172.19.0.4)"
[WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "3398cdca-dc18-a786-e3a6-6f7deb5dcd0d": Node name server-3 is reserved by node d7529d1e-6924-41d8-fc40-fa0f131cc558 with name server-3 (172.19.0.5)"
...

To resolve these errors, you need to perform a consul leave on each server and then start the server again.

Once servers are restarted, the node ids will be set to the expected value and this will resolve the errors in the logs.

Perform a rolling restart of the clients

The same log errors will be present on the clients. After completing the server restarts, you must perform the same operations on the clients to resolve the log errors.

Restore and validate federation

Once restored, it is possible that the new servers will have a different IP than the one used to wan join them from the secondary datacenter. In that scenario, the federation will be broken.

To verify the federation you can use consul members on the primary datacenter.

$ consul members -wan
Node                Address          Status  Type    Build  Protocol  DC         Segment
server-1.primary    172.19.0.3:8302  alive   server  1.9.0  2         primary    <all>
server-2.primary    172.19.0.4:8302  alive   server  1.9.0  2         primary    <all>
server-3.primary    172.19.0.5:8302  alive   server  1.9.0  2         primary    <all>

To restore the federation you must perform a rolling restart of the secondary datacenter servers using the new primary datacenter servers' IP in the retry-join-wan parameter.

After the rolling restart, you should be able to observe all the servers in the consul members command output.

$ consul members -wan
Node                Address          Status  Type    Build  Protocol  DC         Segment
server-1.secondary  172.19.0.7:8302  alive   server  1.9.0  2         secondary  <all>
server-2.secondary  172.19.0.8:8302  alive   server  1.9.0  2         secondary  <all>
server-3.secondary  172.19.0.9:8302  alive   server  1.9.0  2         secondary  <all>
server-1.primary    172.19.0.3:8302  alive   server  1.9.0  2         primary    <all>
server-2.primary    172.19.0.4:8302  alive   server  1.9.0  2         primary    <all>
server-3.primary    172.19.0.5:8302  alive   server  1.9.0  2         primary    <all>

Next steps

In this tutorial, you learned how to restore your federated environment in the event of a full outage of your primary datacenter.

You can follow Provide Fault Tolerance with Redundancy Zones to learn how Consul Enterprise helps you in making the deployment more robust by using different zones (such as AWS Availability Zones) for your Consul deployment.

Disaster recovery

Troubleshoot Consul with hcdiag