Incident Report: Cascading failure on KeyDB server
An issue with KeyDB logging caused our infrastructure to briefly crash. All services are now working.
At the time of publishing, all services should be working as intended. We apologize for any inconvenience.
We have found out that our services are failing due to failure to truncate AOF (Append-only-file) logs causing the KeyDB to get stuck in a crash loop, combined with unfortunate timing caused by our automated maintenance system, suddenly shutting the entire cluster down while the issue has not been resolved.
The crash loop seemed to happen due to KeyDB's restart mechanism trying to load the untruncated AOF log which had a size of 8.9GB, which was too large for the server itself to handle, timing out the pod before it could complete the startup and causing the Kubernetes scheduler to put KeyDB into a crash loop, and not spawning redundant fail-over nodes due to failed checks.
Even worse, after rebooting the cluster the K3s nodes' network started dropping out, causing containers to refuse to start due to pull failures, prompting manual intervention and a second cluster-wide reboot.
Fortunately, the KeyDB snapshot dumps feature was working as intended, saving the database state moments before the actual failure happened, so nothing of value was lost (even though any data loss on KeyDB will not result in any significant loss nonetheless, as we only use as a temporary cache for misc. token cookies). You may have to re-login to your services due to this.
Services Affected: Internal and Customer-facing Services