When I started my new job, last year, my first task was to redesign the infrastructure and move from EC2 Classic to EC2 VPC. I spent the first few weeks setting up a new VPC with different subnets for each concern, a bastion server to access the servers located within the network and a consul cluster to keep track of the running instances.
In the past, I used several ways to manage the accounts and keys to access my servers:
- a central LDAP server, but no jump host;
- a jump host which was using SSH Agent forwarding to access the other servers, but only a handful of accounts/keys managed manually;
- a jump host, and all the accounts created on each server by Ansible, but using the SSH public keys exposed by GitHub to authenticate the user.
After some research, it seems there were no great winner to implement a bastion server. At that time the options were:
- Netflix BLESS, which seemed overkill;
- Hashicorp Vault SSH Secret Backend, which only helps setting up a SSH CA, but doesn’t manage the accounts;
- the good old LDAP server.
I chose to stick with a central LDAP server, a bastion server configured to only allow jumps and all the servers would connect to the LDAP server to fetch the user info. The SSH keys would be stored on S3, and fetched when required using
Note: Since then, Gravitational Teleport has become quite popular and might be a good alternative.
The issue with that first approach was the latency when contacting the LDAP server to retrieve the info and the fact that if the LDAP server was unreachable, it was no longer possible to SSH. That’s when I found Google nsscache which synchronizes a local NSS cache (containing unix users and groups) from a remote directory service. I set up a cron job on all the servers to create a local cache of what was stored in the LDAP.
While I initially set up nsscache to use LDAP, it actually supports several other sources, one of them being Consul. Because, I already had Consul set up, that seemed the way to go. I had to submit a PR to add the support for caching
/etc/shadow, in addition to the already existing caches for
/etc/group. After that, I was able to drop LDAP and use nsscache with Consul to do the SSH authentication.
I will now guide you through the different steps required to set up nsscache with Consul.
Each server must have nsscache installed:
/etc/nsscache.conf file must be configured to read from Consul.
You must then add the groups and accounts to Consul. Here is the tree of the data for a user
foo in a group
The group cache accepts the following attributes, which map to the fields present in
The passwd cache accepts the following attributes, which map to the fields present in
The shadow cache accepts the following attributes, which map to the fields present in
With nsscache configured and your data in Consul, you should be able to generate the cache files with the following command:
To make your system use nsscache, you need to configure
/etc/nsswwitch.conf as follows:
You should also configure PAM to create the home directory of the user upon login, if it doesn’t exist. It can be done by adding the following line to
The last missing step is the management of the SSH key. The following lines need to be added to
/etc/ssh/sshd_config to allow OpenSSH to fetch the keys when a user tries to log in:
After uploading your SSH key to
s3://YOUR_S3_BUCKET/keys/foo, you should be able to SSH into your server.
Throughout this guide, you learned how to use Consul to store the accounts used to SSH into your servers. One of the advantage of Consul over LDAP is that having a cluster with data replicated is easy to setup, and you no longer have a single point of failure.
I’ve written a quick demo, you can download and run on your computer
There are still some improvements possible:
- enabling Consul ACL to avoid other services to read/edit your account data;
- use Vault instead of Consul to store user accounts and passwords.