Post mortem - WireGuard server connectivity issues
What happened
On the evening of April 27, we suffered a partial WireGuard® outage approximately between 19:00 and 23.30 Swedish time. Some servers stopped forwarding traffic due to a NAT issue on servers running kernel version 4.15.
In total, about 130 of 200 servers were affected during this time.
Contributing factors
It has been concluded that only servers running kernel version 4.15 were affected. This was true regardless of OS version, on servers running either Ubuntu 16.04 or 18.04.
The exact reason as to why this kernel version was the culprit, and what caused the issue during this deployment is still unknown.
Resolution
The problem was remedied by updating/downgrading the kernel on the affected servers to a version unaffected by the issue.
Impact
-
Time with partially degraded service - 4.5h.
-
Affected servers - 132 of 202 WireGuard servers.
Timeline
All times are local time, Sweden.
-
2020-04-27 18:14 - Deployment of extended port ranges and system monitoring changes to our WireGuard servers starts.
-
2020-04-27 19:16 - Sharp increase in customers emails about WireGuard servers being down.
-
2020-04-27 19:30 - Engineers begin troubleshooting the issue.
-
2020-04-27 19:40 - Issue is identified to be NAT related, servers are not correctly passing traffic.
-
2020-04-27 20:32 - Issue is identified to be due to kernel version 4.15, same issue was identified on a few servers on the 24th.
-
2020-04-27 20:35 - A mitigation in form of replacing the kernel with a non-affected version is verified.
-
2020-04-27 20:42 - Work begins on compiling a list of affected servers.
-
2020-04-27 21:08 - Remediation starts. Servers starts coming back online one by one.
-
2020-04-27 23:30 - Remediation of the majority of the servers is complete. Remaining servers are marked as being down.
-
2020-24-28 06:50 - 6 servers which died during the original remediation are brought back online.
Improvements to our deployment procedures
In order to minimize the likelihood of another similar incident occurring, as well as its impact, we are making adjustments to our deployment procedures, including the following:
- Extending the minimum time during which changes are deployed and verified working to a subset of production servers, to at least 24 hours before being deployed to remaining servers.
- Ensuring that deployments start at 13:00 or earlier (local Swedish time), to ensure that the majority of our engineers are available in-case of any deployment issues.
- Improving our existing end to end and functional test utilities to verify functionality of servers post deployment.