Virtual Event
Join us for the next HashiConf Digital October 12-15, 2020 Register for Free

Maintenance and Monitoring Operations

Add & Remove Consul Servers

Consul is designed to require minimal operator involvement, however any changes to the set of Consul servers must be handled carefully. To better understand why, reading about the consensus protocol will be useful. In short, the Consul servers perform leader election and replication. For changes to be processed, a minimum quorum of servers (N/2)+1 must be available. That means if there are 3 server nodes, at least 2 must be available.

In general, if you are ever adding and removing nodes simultaneously, it is better to first add the new nodes and then remove the old nodes.

In this tutorial, you will cover the different methods for adding and removing servers.

»Manually add a new server

Manually adding new servers is generally straightforward, start the new agent with the -server flag. At this point the server will not be a member of any datacenter, and should emit something like:

$ consul agent -server
[WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.

This means that it does not know about any peers and is not configured to elect itself. This is expected, and you can now add this node to the existing datacenter using join. From the new server, you can join any member of the existing datacenter:

$ consul join <Existing Node Address>
Successfully joined cluster by contacting 1 nodes.

It is important to note that any node, including a non-server may be specified for join. Generally, this method is good for testing purposes but not recommended for production deployments. For production datacenters, you will likely want to use the agent configuration option to add additional servers.

»Add a server with agent configuration

In production environments, you should use the agent configuration option, retry_join. retry_join can be used as a command line flag or in the agent configuration file.

With the Consul CLI:

$ consul agent -retry-join= -retry-join= -retry-join=

In the agent configuration file:

  "bootstrap": false,
  "bootstrap_expect": 3,
  "server": true,
  "retry_join": ["", "", ""]

retry_join will ensure that if any server loses connection with the datacenter for any reason, including the node restarting, it can rejoin when it comes back. In addition to working with static IPs, it can also be useful for other discovery mechanisms, such as auto joining based on cloud metadata and discovery. Both servers and clients can use this method.

»Server coordination

To ensure Consul servers are joining the datacenter properly, you should monitor the server coordination. The gossip protocol is used to properly discover all the nodes in the datacenter. Once the node has joined, the existing datacenter leader should log something like:

[INFO] raft: Added peer, starting replication

This means that raft, the underlying consensus protocol, has added the peer and begun replicating state. Since the existing datacenter may be very far ahead, it can take some time for the new node to catch up. To check on this, run info on the leader:

$ consul info
    applied_index = 47244
    commit_index = 47244
    fsm_pending = 0
    last_log_index = 47244
    last_log_term = 21
    last_snapshot_index = 40966
    last_snapshot_term = 20
    num_peers = 4
    state = Leader
    term = 21

This will provide various information about the state of Raft. In particular the last_log_index shows the last log that is on disk. The same info command can be run on the new server to verify how far behind it is. Eventually, the server will be caught up, and the values should match.

It is best to add servers one at a time, allowing them to catch up. This avoids the possibility of data loss in case the existing servers fail while bringing the new servers up-to-date.

»Manually remove a server

Removing servers must be done carefully to avoid causing an availability outage. For a datacenter of N servers, at least (N/2)+1 must be available for the datacenter to function. Check this deployment table. If you have 3 servers and 1 of them is currently failing, removing any other servers will cause the datacenter to become unavailable.

To avoid this, it may be necessary to first add new servers to the datacenter, increasing the failure tolerance of the datacenter, and then to remove old servers. Even if all 3 nodes are functioning, removing one leaves the datacenter in a state that cannot tolerate the failure of any node.

Once you have verified the existing servers are healthy, and that the datacenter can handle a node leaving, the actual process is straightforward. You issue a leave command to the server.

$ consul leave

The server leaving should contain logs like:

[INFO] consul: server starting leave
[INFO] raft: Removed ourself, transitioning to follower

The leader should also emit various logs including:

[INFO] consul: member 'node-10-0-1-8' left, deregistering
[INFO] raft: Removed peer, stopping replication

At this point the node has been gracefully removed from the datacenter, and will shut down.

To remove all agents that accidentally joined the wrong set of servers, clear out the contents of the data directory (-data-dir) on both client and server nodes.

These graceful methods to remove servers assume you have a healthy datacenter. If the datacenter has no leader due to loss of quorum or data corruption, you should plan for outage recovery.

»Manual forced removal

In some cases, it may not be possible to gracefully remove a server. For example, if the server simply fails, then there is no ability to issue a leave. Instead, the datacenter will detect the failure and replication will continuously retry.

If the server can be recovered, it is best to bring it back online and then gracefully leave the datacenter. However, if this is not a possibility, then the force-leave command can be used to force removal of a server.

$ consul force-leave <node>

This is done by invoking that command with the name of the failed node. At this point, the datacenter leader will mark the node as having left the datacenter and it will stop attempting to replicate.

»Next steps

In this tutorial, you learned the straightforward process of adding and removing servers including; manually adding servers, adding servers through the agent configuration, gracefully removing servers, and forcing removal of servers. Finally, you should remember that while manually adding servers is good for testing purposes, however, for production it is recommended to add servers with the agent configuration.