Over the Christmas holiday, we were alerted to and fixed an API issue that briefly prevented customers from making payments and creating new accounts. We invite you behind the scenes to learn what happened.
Overview of what happened
On 29 December 2020 we were alerted to our API not responding as it should. Initial inspection gave us the impression that our Keepalived service (used for sharing a floating IP to a healthy node) was not functioning correctly as a result of an incorrectly renewed Let’s Encrypt certificate.
Over the course of five hours, we investigated the issue and deployed a fix that forces our app to use the correct intermediate certificate.
We verified after the fact that the hard coding of our intermediate certificate (performed within our Certbot Dockerfile over the summer of 2020) caused the problem.
How this affected customers
During this time, customers were unable to generate new accounts and make payments to existing ones.
We mistakenly assumed that Let’s Encrypt would not change their default intermediate certificate (the one we were using) before the X3 version was to be invalidated on 17 March 2021, as shown on Let’s Encrypt’s website.
Using historical certificates as reference, the X2 version expired on 20 October 2020. Looking at X2 helped us confirm that Certbot starts using newer intermediate certificates before the old ones expire.
What caused our issue was that the intermediate certificate, R3, starting being used before the X3 intermediate certificate (that our Certbot had opted to use on our servers) became invalid.
Our long-term solution
We have prepared a long-term solution that will prevent the issue from happening again. It will be deployed before the end of this month and will coincide with the renewal of our Let’s Encrypt certificates across all of our web-facing applications.
Detailed timeline of events
All times are local time, Sweden.
- 2020-12-29 13:15 The Infrastructure and Services teams are alerted to a somewhat faulty API.
- 2020-12-29 13:50 Investigation begins with members of both teams after they have all gathered online.
- 2020-12-29 14:15 Our initial investigation is completed, and we start deploying potential fixes to what we think is the issue, relating to Keepalived.
- 2020-12-29 14:30 We add further tweaks to mitigate the Keepalived issue to add static networking settings. This is a temporary workaround.
- 2020-12-29 14:45 Continuing with investigation leads us to read through our Nginx configuration and Nginx error logs (here we can view only the server-side errors, no user data can be found here as you would expect). Research is ongoing, with other team members helping where appropriate, and some tweaks are prepared to be deployed.
- 2020-12-29 15:00 We deploy some tweaks to our Nginx configuration, then monitor the situation and keep investigating.
- 2020-12-29 15:30 Discussions and theories about our Let’s Encrypt certificate being the issue start to come to light, and there are more direct workarounds in progress.
- 2020-12-29 16:00 The Let’s Encrypt theory seems to be correct, and work begins on a fix to be deployed.
- 2020-12-29 16:30 The fix is deployed to the main server in our API cluster and is resolved as functioning. This is then rolled out to the other servers in the cluster.
- 2020-12-29 17:45 After a postmortem and continued discussions across teams, the issue is marked as resolved.