Distributed locking with Consul

A few months ago, I had to set up a daily job to perform the backup of some database running in a cluster. The job had to run on one of the nodes of the cluster. But configuring a cron job on only one server would mean that if the node is unavailable, the backup won’t be performed.

I decided to configure the cron job on all the servers, and to set up some kind of distributed locking to ensure that only one of the nodes will actually perform the backup every day.

Because I already had Consul running in my infrastructure, I chose to use the Consul Semaphore system. The semaphore system being a bit complex for what I wanted to achieve, I used the Leader election algorithm. The main difference being that only one session can acquire a lock, while the semaphore algorithm allows to have several sessions holding a lock.

To acquire a lock, a session needs to be created first. This can be achieved with the following HTTP request:

1
2
3
curl -XPUT \
"${consul_address}/v1/session/create" \
-d "{\"Name\": \"${task_name}\"}"

Then, the lock can be acquired using the following HTTP request:

1
2
curl -XPUT \
"${consul_address}/v1/kv/locks/${task_name}/.lock?acquire=${session_id}"

If the lock has been acquired, the response body will be true.

By specifying a cron job at the same time on all the nodes, the first one to acquire the lock will perform the task, while the script on the other nodes should exit immediately.

When the task is completed, the lock needs to be released with:

1
2
curl -XPUT \
"${consul_address}/v1/kv/locks/${task_name}/.lock?release=${session_id}"

The session also needs to be destroyed before exiting the script.

1
2
curl -XPUT \
"${consul_address}/v1/session/destroy/${session_id}"

If we put everything together, it looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
session_id=$(curl -s -XPUT "${consul_address}/v1/session/create" -d "{\"Name\": \"${task_name}\"}"  | jq -r '.ID')

acquire_lock() {
session_id=${1}
echo "Trying to acquire the lock..."
result=$(curl -s -XPUT "${consul_address}/v1/kv/locks/${task_name}/.lock?acquire=${session_id}")
[ "${result}" == "true" ] && echo "Lock acquired"
}

release_lock() {
session_id=${1}
echo "Releasing the lock..."
result=$(curl -s -XPUT "${consul_address}/v1/kv/locks/${task_name}/.lock?release=${session_id}")
[ "${result}" == "true" ] && echo "Lock released"
}

destroy_session () {
session_id=${1}
echo "Destroying the session..."
result=$(curl -s -XPUT "${consul_address}/v1/session/destroy/${session_id}")
[ "${result}" == "true" ] && echo "Session destroyed"
}

if ! acquire_lock "${session_id}"; then
destroy_session "${session_id}"
echo "Unable to acquire the lock."
echo "The job is probably already running on an other server."
exit 0
fi

do_stuff

release_lock "${session_id}"
destroy_session "${session_id}"

A better way to destroy the session would be to use a trap, as explained in this blog bost by David Pashley.

I chose to explain the way distributed locking works with Consul using bash and curl, but because it’s using the HTTP API, you can also use any programming language which can do HTTP requests, or any Consul client.

After reading this blog post, you should now be able to run tasks distributed across your cluster and use Consul to achieve distributed locking when necessary.