Incident Report: Full Networking failure on April 9, 2023
On around 2AM ICT, we have detected some unusual CPU utilization on one of our cluster nodes, presumed to be some kind of CRI bug related to containerd and runc which is an upstream issue, throttling all other services and causing them to fail. We have decided to do a full reboot of the affected nodes. Assuming that our systems were not being under active usage, we decided to bring the cluster down for maintenance.
What we did not know however, was that there were more underlying issues waiting for us after the reboot.
At around 2:30 ICT, we noticed some unusual problems in some of our services indicating that the nodes that were being rebooted have failed to establish a connection with our service mesh, preventing all pods on the node to allocate an IP address and causing them to fail.
The service failures started rolling in as all services scheduled to be on that node failed to restart, including:
- HashiCorp Consul
- HashiCorp Vault running on High-availbility mode, using the Consul Backend
- The entire Grafana LGTM Stack, also relying on Consul for its internal service mesh, including Loki, Tempo, and Mimir
- Traefik (partially)
- Terra (partially with package uploads disabled)
- Madoguchi (metadata microservice for Subatomic)
- Hetzner Cloud CSI Service
- JuiceFS CSI Service
Upon further inspection, we have discovered that the reason the pods were failing to allocate their IP addresses were due to an issue related to Multus, a meta-CNI we use in our stack along with Calico to add support for additional networking plugins, like bandwidth limits and additional network metering policies.
Multus is supposed to be used as the main CNI interface for all networking, in which we were supposed to run Calico on top of. However, for some reason it has failed to contact the Kubernetes API.
2023-04-08 21:56:16.577 [ERROR] plugin.go 580: Final result of CNI DEL was an error. error=error getting ClusterInformation: Get "https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: i/o timeout
Multus was attempting to re-allocate its IPs and virtual network interfaces after we have done a cold reboot on the nodes, but it was somehow failing because it has failed to contact the Kubernetes API through its own proxy, which also failed to start.
We assumed that this was a temporary issues that would fix itself later on, but we were completely wrong.
We attempted system upgrades on other nodes, also issuing them cold reboots in hopes that this will re-initialize the mesh, but the same network failures started appearing on those nodes too, eventually spreading to all nodes being completely isolated from its own network and unable to establish a network connection at all.
Our core services such as the Terra repositories seem to be still up at that moment, so we have decided to leave it as-is for further troubleshooting.
At 6:31 ICT, developers started reporting that the Terra repositories have gone down, escalating this into a major incident as it has fully taken down all services.
Fortunately, we have recorded this incident in detail earlier and our team quickly continued working on fixing the issue.
We managed to completely uninstall Multus and Calico replacing it with the Cilium CNI, which worked well enough and was fully supported by RKE2, fixing the issue by replacing the outdated networking backend with eBPF.
By 14:30 ICT, we rolled back most of the cluster settings (except for the CNI) and all services are now back up.
We are sorry for the inconvenience for the last several hours, the cause of which is still unknown.
On the bright side
We have discovered the the Cilium CNI worked very well with our setup, even better than Calico.
Initially we were basing our choices on this blog post for our CNI backends, which shows the following:
While we see the merits of using eBPF for our networking proxy instead of iptables, we initally decided against Cilium due to its resource footprint in this chart. However, in our real-world use case it seemed that Cilium was using an acceptable amount of resources as all the other CNIs we tried, i.e Calico, Canal, and Flannel in our acceptable threshold of 1 core per node.
So, in practice, the slightly heavier footprint of Cilium was negotiable in our small cluster use-case, and seemed to be working very well on our setup.