Full Sail: Migrating our Infrastructure to Kubernetes

A story of migration, corner cutting, and elaborate hacks in an attempt to achieve high availbility and observability

It's been a looong while since we wrote a new post on this blog, and for a very good reason. We've been really busy migrating our infrastructure to Kubernetes.

Some of you may have seen the announcements on our Discord regarding the recent unstability of our hosted services, and it was due to this process.

Deploying and maintaining software on servers is a pain. Seriously. We have a large stack of both internal and external services that we use to aid in development and operate internally. The issue is: our infra is too decentralized and too centralized simultaneously.

Since the acquisition of Ultramarine Linux, we have a fleet of services running different services, but everything ran on one single Proxmox VE node in Los Angeles. The issue is we have to deploy every single service manually (and that node being a single-point-of-failure.) We deploy an LXC container for each service, deploy Tailscale on it, and then use it as a mesh that connects them, all by hand.

This convoluted process got us thinking; is there a way to combine our compute power and deploy services on all of them at the same time, while also ensuring high-availability? The answer is Kubernetes.

Kubernetes allows us to run OCI/Docker containers inside a rescheduleable and load-balanced cluster that ensures any application or service we deploy will be moved elsewhere automatically in case the server running it stops working.

Picking a Solution

The first thing on the list for this migration was to pick a cloud provider we could rely on to scale these workloads effectively while being affordable. While we could use a hosted Kubernetes service like Amazon's EKS, Google's GKE, or Microsoft's AKS. We have found that with our current workload, it would've cost thousands of dollars monthly. So we decided to host our own Kubernetes cluster, deployed on cloud VMs to ensure that we can scale whenever we need and can afford to, not when the application wants to.

We tried a couple low-cost cloud providers, but we found low-end cloud platforms are very unstable, have no features, or are really slow. So we chose Hetzner Cloud due to its balance of features, reliability and cost.

Our current setup is a 3-node HA cluster running on Hetzner, 4 vCPUs and 8GB of RAM each deployed in a spread-mode AZ (multiple datacenters) to ensure high-availability for services, so when one fails, our stack will run in a degraded state (some services unavailable) rather than failing completely. Our costs totalling out to around $50USD a month.

Kubernetes Itself

Like other OCI runtimes, Kubernetes comes with various distributions made for specific platforms or purposes. Hosted Kubernetes providers have custom engines, and there's the engines made to be run on-premises, which is what we will be using.

While a standard Kubernetes installation with kubeadm is great, it's very bloated and comes with many unnecessary components made to integrate with cloud providers (when we only use one of them), making it overkill for a small cluster.

K3s is a lightweight Kubernetes distribution by Rancher, that aims to strip it down to its basics. It comes in a single binary with various embedded components. These include an embedded copy of etcd for storing cluster state, the Flannel CNI for basic container networking, the CoreDNS service for cluster DNS, and the standard API server for controlling the cluster. It also comes with an optional installer for the Klipper/ServiceLB load and Traefik for reverse proxy ingress.

While K3s may look like an appealing option when you're running Kubernetes on a budget, the issue is that the core components of K3s are embedded and run inside a single executable. So when a clustered component like etcd fails due to an unstable network connection, all the components like Flannel are also shut down, causing an instant network failure for that one node. If that one node also happens to schedule workloads (as you can't really afford that much to allocate dedicated nodes anyway), all other services that rely on that service also start failing, causing a cascade failure.

So we picked a different solution: RKE2. An another distribution by Rancher based on K3s, focused on security hardening and standards compliance. While mostly designed for strict enterprise environments like Government or military deployments, it is also a great (but slightly heavier) alternative to K3s.

Storage: JuiceFS & Hetzner Cloud

By default, all Kubernetes workloads are stateless. That means everything is ephemeral and will be wiped like it never existed once the process stops running. But not everything is supposed to be that way. You can never save your data, so all the work just produces hot air in the end.

Fortunately, K8s supports various CSI volume drivers that can be used to create, mount, and manage stateful volumes when needed. One of these solutions was Longhorn, a cloud-native replicated block storage solution made by Rancher. While Longhorn was great for smaller workloads, when faced under unstable and/or edge-case conditions, volumes can often become corrupted and fail, putting them into read-only mode or sometimes render them completely unusable.

We ended up with both; a combination of Hetzner volumes for databases, and JuiceFS volumes for other applications that don't need fast I/O.

Configuration: Flux CD

What's the point of setting this all up if you cannot deploy applications in an observable and centralized manner, right? Fortunately, we know a little service that helps you do that very easily: Flux CD.

While yes, we use Rancher for our control panel, we decided against using Rancher Fleet to deploy applications in our clusters, due to one small issue: observability.

While Fleet is a great tool that comes with the Rancher Manager and allows you to do GitOps quickly, we miss one thing from Flux that we really need: Notifications. We cannot observe the state of our deployments after provisioning it, and then have it notify us when something inevitably goes wrong.

Weaveworks has given us a great tool for managing cluster deployments and managing our workload configurations in a single monorepo.

In production, we use Flux CD manifests with encrypted secrets provided by a combination of SOPS, HashiCorp Vault, and authentik for authentication and access management. Only sometimes do we directly edit cluster resources for debugging.

The Story: Setting it up

I got approval to start provisioning nodes on Hetzner Cloud. We deployed 3 CPX31 servers, and wrote an Ansible playbook to automatically provision them and install Rancher RKE2.

We used the RKE2 Ansible Role to provision our Kubernetes clusters along with a collection of custom performance scripts and manifests to further customize the cluster deployment to our. We however did not decide to use the Hetzner Cloud CCM to manage our cluster, instead opting to use the built-in RKE2 CCM module to manually manage the cluster, just in case we decided to go multi-cloud and scale up our cluster to multiple regions.

We decided to use Calico for our CNI, which we had a rough start at first. Configuring Calico on an RKE2 cluster means that you will have to manually edit manifest files from within the nodes themselves, which is a little bit annoying to work with, but it's fine enough.

For ingress, we initially decided to go with the built-in NGINX Ingress controller, which was a hardened fork of the K8s NGINX Ingress Controller, and eventually, we encountered some pretty bad performance issues with it, taking a long time to refresh and regenerate reverse-proxy configurations for ingress services. So we went with Traefik instead.

At first, setting up was mostly a breeze, we had to copy some configurations and migrate our services from dedicated Docker Compose stacks into Kubernetes deployments, then the issues started piling up. Especially for the resource usage.

The stack

Currently on our cluster, we deploy various different services and components to meet our new requirements under a distributed workload.

Weblate (https://weblate.fyralabs.com)
Rancher Server, for cluster management and control plane
Weave GitOps, to assist us further in our Flux CD deployments
MetalLB, as an external load balancer solution
The Calico CNI
CrunchyData PGO operator for PostgreSQL
Synapse server for Matrix chat
Outline, as our internal knowledge base
Authentik Identity server
HashiCorp Consul service mesh
HashiCorp Vault for secrets management
Plausible analytics
Traefik Ingress controller
Mastodon (https://mastodon.fyralabs.com)
Raboneko, an internal progress tracking and utility Discord bot, which we use for our weekly updates
Lanyard, a service for querying Discord presense, used on our about page
Subatomic, our in-house package mangement server, used in Terra
Madoguchi, another in-house service to track builds for Terra
Ghost, the blog you're reading this on :)
KeyDB as an alternative to Redis
Bitpoke MySQL Operator
Cert-manager for automatically provisioning certificates, from Let's Encrypt, self-signed certs, or Vault certificates.
JuiceFS, a POSIX-compatible distributed filesystem that only needs a database and an object store, or simply just the database.
And many other services, which are still being migrated or forgot to mention :p

Hetzner Cloud integrations

Hetzner Cloud CSI Controller, for provisioning volumes to be used under K8s.
Hetzner Cloud floating IP controller, used in conjunction with MetalLB to provide a cheap alternative to Hetner's own load balancer service.

Overloaded: How we hit a limit with our workload, and our budget

We decided that this was going to be a low-cost, self-maintained stack where we are willing enough to hack around and find what works in a cluster. Per my usual SOP, we started setting up our monitoring stack.

Let's talk about our monitoring stack. You can't really just go in blind with a bunch of computers, deploy programs on them and expect all of them to run all the same. That's why we decided to set up our own monitoring stack, with all the software we love. Despite how wonderful that sounds, it was a dreadful experience, and we almost abandoned this project over it.

Watching our cluster go by while it sets everything on fire

So, we use the Grafana LGTM stack for monitoring our systems. And for those who don't know what that is, it's essentially a suite of software made by Grafana Labs designed to help administrators (like me) observe and monitor networks. Comprised of the following services:

Loki, a log aggregation system made to help store and query log files, then analyzes them so they can be used to make sense of what your systems are doing.
Grafana, an analytics dashboards that allows users to create complex and detailed dashboards for analyzing, and querying time-series data from various sources.
Tempo, a distributed tracing system that is compatible with Jaeger, Zipkin, and OpenTelemetry that's designed to be the central store for all your application tracing
Mimir, a horizontally scalable and multi-tenant time-series database that is designed to be used as a long-term store for Prometheus telemetry. It is a fork of Cortex Metrics focused on performance and storage.

Additionally, services that are used to ingest data into the aforementioned stacks:

The Grafana Agent, a system daemon that lets you collect any system information all in one application, then ship them to Loki, Tempo, or Mimir.
Grafana Phlare/Pyroscope: A very new addition to the Grafana monitoring suite, they are both performance profilers that is made to trace and profile your application's performance, and gives developers insights on how to improve them. Grafana used to have Phlare as their own product, but since this March, they have decided to merge with Pyroscope and acquire their profiler, merging them with Phlare as Grafana Pyroscope.

After setting everything up, and finally being able to view the metrics that it collected, we have found something peculiar:

The monitoring stack itself uses a lot of memory, almost half of the cluster's memory.

How?

Optimizing the monitoring stack

We tried many solutions, using Netdata instead for monitoring, using vanilla Prometheus, but seems like it all came to the same issue: Aggregating all this data is actually very computationally expensive. So we just decided to reconfigure the stack itself.

While Mimir is the biggest offender of all the 4 applications, ingesting hundreds of thousands of system metrics and keeping all of them in memory, we have found a way to optimize its memory footprint.

Disabling replication, lowering TSDB striping sizes, and frequently compacting blocks was the way to go.

This is our incredibly cursed (but working) configuration for Mimir. It strips down memory usage in exchange for slightly slower query times

    mimir:
      structuredConfig:
        querier:
          iterators: true
          batch_iterators: true
          # ...
		compactor:
          compaction_interval: 1m
          block_ranges:
            - 2h0m0s
            - 12h0m0s
            - 24h0m0s
          compaction_concurrency: 20
          cleanup_interval: 10m
          deletion_delay: 24h
        ingester:
          metadata_retain_period: 1m
          instance_limits:
            max_series: 200000
        multitenancy_enabled: false # TODO: Re-enable this as we scale
        limits:
          request_rate: 10
          max_global_series_per_user: 0
          accept_ha_samples: true
          ha_cluster_label: cluster
          ha_replica_label: __replica__
          ingestion_rate: 50000
          max_fetched_chunks_per_query: 0
          compactor_blocks_retention_period: 30m
          out_of_order_time_window: 10m
          compactor_split_and_merge_shards: 16
          ruler_max_rules_per_rule_group: 0
          ruler_max_rule_groups_per_tenant: 0
        common:
          storage:
            backend: filesystem
        blocks_storage:
          backend: filesystem
          tsdb:
            retention_period: 24h
            ship_interval: 30m # Originally at 1m, we will explain why later
            head_compaction_concurrency: 60 # to compensate for the frequent head compacts
            head_chunks_write_queue_size: 0
            stripe_size: 512 # The real memory hog, slightly reduces performance in favor of memory footprint
            head_chunks_write_buffer_size_bytes: 512000 # smaller value
            ship_concurrency: 1
            head_compaction_interval: 10s
            wal_compression_enabled: true
            flush_blocks_on_shutdown: true # We only enable this for data persistence
          bucket_store:
            max_chunk_pool_bytes: 536870912
            sync_interval: 6h
            chunks_cache:
            # note, `ship_interval`
              fine_grained_chunks_caching_enabled: false
        alertmanager_storage:
          backend: filesystem
        ruler_storage:
          backend: filesystem

At this point we noticed another issue: Grafana Agent, a bundle of various metrics servers written in Go, designed to be a monolithic but lightweight telemetry agent that allows for scraping logs, metrics, accepting traces from Jaeger and OpenTelemetry and more in a single daemon. However, it is designed to not store data permanently on the disk and ship all the data it receives to a central TSDB.

Usually this is used for shipping telemetry data to Grafana Cloud, but it can also be configured as generic telemetry to any of its supported push protocols:

Prometheus (Thanos, Cortex, Mimir, VictoriaMetrics, InfluxDB)
Jaeger
OpenTelemetry
Loki

Grafana Agent also comes with integrated Prometheus exporters for scraping many services without installing additional exporter services, including node-exporter for scraping usage, Redis monitoring, PostgreSQL monitoring, Windows node monitoring and many more. You can look at all the included integrations here.

I may be getting sidetracked here, so here's the actual issue. Grafana Agent's Prometheus scraper simply takes too much memory.

Another issue is using the Grafana Agent Operator to scrape data, not the standalone agent instance. The operator helps manage a Grafana Agent instance, using Kubernetes Custom Resource Definitions (CRDs), essentially a schema for a custom resource object that may be used by other services, like operators. The Agent Operator works similarly to the Prometheus Operator, but does not support in-depth configurations at the time of this writing.

So, what's the solution?

As I did mention earlier, the Grafana Agent is written in Go right? This allowed us to (ab)use Go's garbage collector, using an aggressive garbage collector configuration.

GOGC=10 # Start garbage collecting when at 10% memory usage
# Set a soft limit so the GC attempts to lower memory usage below this impossible threshold
GOMEMLIMIT=250MiB

While this comes with a significant increase in CPU time as Go tries to find and evict anything from memory the instant it stops being used, it results in a whopping 70% reduction in memory usage. Most of our workloads are usually memory-heavy and not CPU-heavy, so we had to make the most of it.

The cost overrun begins

As a couple days go by with our optimized cluster. It's was around 3 days after I set up the LGTM stack, we were currently in the process of migrating our artifact manager to Kubernetes to experiment with hosting RPM and OSTree repositories over Cloudflare's R2 storage when we decided to take a look at some buckets. To our collective shock, our aggressive Mimir configuration was actively making frequent calls to the R2 bucket. Around 24 million GET requests to be exact. In 3 days.

This lead to a mass panic.

What happened?

Our old Mimir configuration, contained these lines:

blocks_storage:
	backend: s3
    ...
    # note these block ranges, it will be a massive footgun later
    block_ranges_period:
    	- 1m0s
        - 5m0s
        - 1h0m0s
        - 6h0m0s
        - 12h0m0s
        - 24h0m0s
    ship_interval: 1m # footgun #2
    retention_period: 30m # BIG FOOTGUN
    ...

As you can probably figure out by now, due to an attempt to quickly flush blocks from memory to cold storage, we have been setting very short block range intervals, then quickly deleting all local data within 30 minutes of them being uploaded to R2. While uploading each block every minute.

While the minute upload did not actually affect our costs very much (thank you, Cloudflare), the major pain point was the 30 minute retention period.

We were immediately deleting data off the local WAL every 30 minutes, the issue was, we were keeping track of the data way longer than 30 minutes.

We were storing the data and periodically checking it for abnormalities then send alerts to us using Alertmanager. The fun part is: We are tracking longer than 30 minutes, plus the constant lookback querying on Grafana from us figuring out how to further optimize our stack, going as far as 24 hours or a few days to compare the results. Those eventually accumulated to thousands, then millions, then eventually 20 million requests from trying to download many blocks off those storage backends.

Fortunately, the cost was around 11$ total. But that could quickly escalate if we left it any longer. Within 20 minutes, we quickly uninstalled Loki and Mimir, and then reinstalled them using local disks for long-term storage instead.

Conclusion of the LGTM Saga

Don't be fooled, we love Grafana. We really do love them, and actively support them.

If it weren't for our love of Grafana, we would've not persevered and continued attempting to hack their solutions to our use cases.

While technically we may not agree with their choice to use Go (and sometimes, Python) to create their monitoring software, we love what they've done to Grafana, and the LGTM stack to allows for a simple yet powerful way to set up telemetry on any machine, with any software, for however long you want.

Grafana itself provides a powerful interface to set up elaborate, detailed dashboards detailing every little bit of data with the respect it deserves. While Loki provides a lightweight but very powerful log aggregation system that can quickly process and format log files, albeit a logfmt log, JSON, or any kind of logging format you need using either regular expressions or declaring easy-to-read patterns for all the obscure log formats out there. Tempo provides a one-stop tracing solution that brings the likes of Jaeger, Zipkin and OpenTelemetry together in one neat little package under a single database. Mimir brings you the industry-standard Prometheus database, and cranking it up to 11 by making all the components distributed and infinitely scalable as long as you have an object store.

Finally, the Grafana Agent that allows you to collect all these amounts of data in one single package.

Lessons learned

Do not ever underestimate your storage.
Sometimes, a state-of-the-art monitoring solution might be simply too big for small infra.
Always keep track of how much you're spending, or you'll awaken the COO.
It's okay to move fast and break things sometimes, especially when it results in a net benefit. Just be transparent and always inform users first.