Revamping the SKS Node OS Image: Performance, Tooling, and Security

Exoscale’s SKS (Scalable Kubernetes Service) has introduced a major revamp of the Kubernetes node base image as part of the Kubernetes 1.32 update. This major revamp focuses on streamlining the operating system running nodes, for performance improvements, leveraging modern tooling for image build and node bootstrap, and tightening both the security and the reliability of the node environment. In this post, we will dig into the technical details of these changes, highlighting enhancements, explaining the role of mkosi and sks-node-agent (a new, custom instance initialization program), and examining the implications of a leaner OS design.

Performance improvements in the new SKS Node Operating System image

One of the primary goals of the SKS Node Operating System image (abbreviated as “OS image” in the rest of this article) revamp was to improve efficiency by removing unneeded components and optimizing the boot sequence. We observed measurable improvements in those areas.

Regarding boot time, by stripping the node down to only required components, the new SKS node image achieves much quicker boot and initialization. A leaner OS means fewer services to start and less disk I/O during boot. This translates to nodes joining the cluster faster after they are provisioned. Node boot time itself (excluding deployment of the instance itself) was reduced from 9 seconds to 4 seconds (an improvement of 55.6%). These results match our expectations: an OS with a smaller footprint typically yields shorter boot times and faster scaling of nodes.

In the new OS image, we also optimized resource usage, lowering the runtime overhead on each node. We removed unnecessary services from the OS, like SSH, rsyslogd, and others. In practice, this means that a small amount of CPU and memory resources are freed up for your final containerized workloads. As a consequence, our new OS image consumes 118MB of RAM (excluding kubelet) when IDLE, compared to 132MB in the previous implementation under the same conditions (an improvement of roughly 10%).

The revamped node image is much smaller in size. We eliminated about 1.4 gigabytes of unneeded data from the image, reducing it from almost 2GB to about 600MB (an improvement of 70%). How did we achieve that? We changed our image-building process, starting from scratch with only the packages we need instead of removing non-essential packages from an already-existing full-featured image. The previous image design typically included a bit more than 400 packages, while our new implementation includes a bit less than 200 packages. In theory, a Kubernetes node image can be even smaller, but we needed to still support some customer use cases, and we wanted to still rely on standard Linux userland tools like systemd. Other open-source projects achieved comparable reductions in image size using similar processes. For instance, for Constellation, Edgeless Systems (reduced their node OS image size from 3GB down to 700MB). This size reduction greatly improves efficiency and contributes to faster deployments of nodes when the OS image is not yet on target hypervisors.

Another improvement is the OS startup process simplification. The boot process was simplified by removing legacy or unsuitable components. In previous implementations of the OS image, cloud-init was used as the instance initialization process, injecting essential data pre-rendered by the SKS orchestrator (our software component in charge of spawning SKS control planes and handling SKS nodepools). We replaced this cloud-init process by leveraging standard systemd features and implementing missing ones in a completely custom-made compiled software we call sks-node-agent, which is discussed later in this article. This change also contributes to faster boot time and reduction in size (cloud-init depends on various Python libraries while sks-node-agent is a self-sufficient compiled program, and it handles networking while we delegate this task to systemd-networkd in the new image).

During our internal tests, all of these small improvements reduced the time between node-pool creation and the first node being registered in the Kubernetes control plane from ~35 seconds to ~20 seconds (an improvement of more than 42%).

While we focus on how nodes are working in this article, you may have some interest in reading one of our previous articles: A deep dive into Exoscale SKS internals.

At the heart of the OS Image revamp: `mkosi` build and `sks-node-agent` initialization system

In order to achieve the above-mentioned improvements, we revamped both how the OS image is built and how the OS boots at runtime. This section will outline two key choices making this implementation possible: the use of mkosi for image creation and the introduction of sks-node-agent for compute instance initialization. We will explain how each works and why they were chosen over more traditional approaches.

Leveraging `mkosi` to build OS images

In the previous design of the OS image, we based our approach on a slightly modified version of the Ubuntu cloud image, leveraging HashiCorp’s Packer and our Packer plugin. This approach helped us get started quickly by using well-known tools. Over time, however, we also saw opportunities to further streamline and modernize the build process: we wanted more precise control over the contents of the image, improved reproducibility, and deeper support for newer boot and security capabilities.

To address these goals, we decided to replace Packer with mkosi. mkosi creates OS images in a declarative way: as a developer, you provide a configuration describing the target OS properties (like the base distribution, packages, additional files, and settings), and mkosi assembles a bootable image accordingly in a controlled, ephemeral environment (using systemd-nspawn).

Under the hood, mkosi leverages the native package manager of the selected base Linux distribution. For instance, when you select a Debian-based distribution, it leverages debootstrap; if you select Fedora, it leverages dnf --installroot. This approach bootstraps a fresh filesystem with the minimum amount of required packages and those you defined in the configuration, resulting in a cleaner, more repeatable image build.

Why we chose `mkosi`?

mkosi‘s straightforward, declarative style for defining OS image content brings consistency and control. For example, if we need to rebuild the same image for Kubernetes 1.32.2 or iterate on it with minor changes, we can do that in a repeatable manner without introducing unexpected differences. mkosi also enables minimal builds—only including packages that are truly needed: the Linux kernel, systemd, the container runtime (containerd), kubelet, a few utilities (for networking, CNI plugins, etc.), and little else.

Another notable feature of mkosi is that it simplifies the adoption of modern boot mechanisms, such as Unified Kernel Image (UKI). While not all features are employed right now, it gives us options to improve node security going forward.

Finally, mkosi’s incremental build support speeds up local development and testing, making our iterative workflows more efficient.

Here is an example of how an OS image is defined with mkosi; it is quite self-explanatory:

[Distribution]
Distribution=ubuntu
Release=noble

[Content]
Bootable=yes
Bootloader=uki
Packages= # Minimum bootable OS
          linux-image-virtual
          systemd
          systemd-boot
          systemd-resolved
          systemd-timesyncd
          ...

[Output]
ManifestFormat=json
Format=disk

`sks-node-agent`: Replacing `cloud-init` for compute instance initialization

The second key component of the OS image revamp is the new sks-node-agent, a lightweight custom initialization service that runs on each node at boot. Previously, we relied on cloud-init to configure cluster-specific settings for each Kubernetes node (e.g., the kubelet‘s TLS bootstrap token, Kubernetes API server endpoint, node labels, taints, etc.). This approach worked well for many scenarios but had room for further tailoring and consistency improvements.

With the new OS image, we chose to remove cloud-init and its dependencies in favor of a specific tailor-made agent handling SKS features. The sks-node-agent is a small Go program that starts at the beginning of the node’s lifecycle and integrates closely with the rest of the system.

When the SKS compute instance boots, networking is handled by systemd-networkd, and right after, sks-node-agent is triggered by systemd. The sks-node-agent retrieves the needed configuration to join the cluster from the compute instance metadata service, which the SKS orchestrator populates with the cluster information and node parameters in a declarative TOML file. The sks-node-agent parses this content and applies the required actions: generating kubelet configuration files, bootstrap kubeconfig, and relevant systemd overrides. Finally, it launches kubelet and exits.

Once the kubelet service starts, sks-node-agent exits and does not run again unless the instance is restarted.

Here is a global overview of the whole startup process:

SKS Node startup process

By shifting to a declarative approach for node initialization, sks-node-agent avoids shell-script rendering intricacies and offers a more predictable, testable path for cluster bootstrap logic. At the same time, we gain the flexibility to extend this agent with new features as SKS evolves.

Why these changes matter?

You may think the numbers outlined in the first section are small gains, but collectively, they reflect a higher overall quality of the OS image.

From an engineering perspective, adopting mkosi and introducing sks-node-agent is about control and optimization. If we need a new package or a configuration tweak for Kubernetes, we simply update the mkosi config and rebuild, rather than waiting for distro-level updates or deeply customizing a generic cloud image. The node image is now a purpose-built Kubernetes appliance rather than a specialized flavor of a broader distribution, aligning with approaches taken by other Kubernetes providers (for example, GKE’s Container-Optimized OS, EKS’s Bottlerocket, etc.).

Similarly, having our own agent (sks-node-agent) for node bring-up lets us coordinate and tightly adapt the process with the SKS control plane, opening the door to advanced lifecycle management features. In the future, it can handle additional node checks or setup tasks tailored to SKS. These are tasks that can be cumbersome in more generic workflows but become straightforward with a custom agent.

Implications of a smaller, almost immutable OS image

We already talked about performance and lifecycle management improvements in this OS revamp. In addition to those aspects, this revamp improves the security posture of SKS clusters. Indeed, by having a smaller OS image with fewer components, each Kubernetes node has a reduced attack surface. Every omitted package or service removes a potential path for vulnerabilities.

For instance, removing the SSH server eliminates the risk of SSH-based attacks or unauthorized access, making it much more difficult for an attacker to directly log into a node. Dropping most unnecessary software from the OS image turns it into more of an appliance: changes to it are delivered as full image updates. At the same time, we have fewer security issues to track and patch, which aligns with the broader industry trend toward container-focused, minimal operating systems.

Because we no longer ship an SSH service on the node, any non-standard actions should be done via standard Kubernetes APIs (e.g., kubectl debug for ephemeral troubleshooting pods). We encourage treating nodes as cattle (managed with consistent, automated processes), rather than pets that require manual, one-off modifications.

The security improvement of our OS image is thus a natural outcome of the optimizations described above.

Conclusion

The SKS Kubernetes 1.32 node OS image revamp is a significant technical improvement that brings faster OS readiness, better resource usage, and a hardened profile to Exoscale’s SKS Kubernetes service.

By leveraging mkosi to craft a purpose-built minimal OS and replacing conventional init mechanisms with the dedicated sks-node-agent, we’ve achieved a level of efficiency and control that directly benefits Kubernetes users on SKS. While we haven’t detailed our entire image-building workflow here, this new pipeline now allows us to adopt upstream Kubernetes releases more quickly. In cases where a new version introduces no breaking changes, we anticipate publishing it within just a few days after its official release.

We are confident you will notice and appreciate these changes that streamline the node OS (both in bytes and processes) and provide a cleaner foundation for standard Kubernetes operations. Faster node availability means your clusters can scale more swiftly and reliably. The tight integration of our node agent with SKS automation also simplifies node lifecycle management and paves the way for future enhancements. In parallel, with fewer packages, your nodes carry a smaller attack surface, offering a more secure baseline.

Looking ahead, we plan to continue refining our approach. Future Kubernetes version updates will benefit from this build pipeline, and we may introduce more capabilities in the node agent (for example, smarter update coordination or smaller, specialized images for certain workloads). For now, if you’re an SKS user upgrading to 1.32, you’ll immediately notice these benefits under the hood.

NOTE: During the drafting of this article and while we shipped the initial implementation of Kubernetes 1.32 in SKS, a number of Kubernetes patch versions were released. We took the opportunity to leverage this new OS image design for Kubernetes 1.29.14, 1.30.10, 1.31.6, in addition to 1.32.2. We encourage you to upgrade your cluster by following our documentation.

Performance improvements in the new SKS Node Operating System image

At the heart of the OS Image revamp: mkosi build and sks-node-agent initialization system

Leveraging mkosi to build OS images

Why we chose mkosi?

sks-node-agent: Replacing cloud-init for compute instance initialization