You are viewing documentation for Cozystack v1, which is currently in beta. For the latest stable version, see the v0 documentation.
Cluster Autoscaler for Hetzner Cloud
This guide explains how to configure cluster-autoscaler for automatic node scaling in Hetzner Cloud with Talos Linux.
Prerequisites
- Hetzner Cloud account with API token
hcloudCLI installed- Existing Talos Kubernetes cluster
- Networking Mesh and Local CCM configured
Step 1: Create Talos Image in Hetzner Cloud
Hetzner doesn’t support direct image uploads, so we need to create a snapshot via a temporary server.
1.1 Generate Schematic ID
Create a schematic at factory.talos.dev with required extensions:
curl -s -X POST https://factory.talos.dev/schematics \
-H "Content-Type: application/json" \
-d '{
"customization": {
"systemExtensions": {
"officialExtensions": [
"siderolabs/qemu-guest-agent",
"siderolabs/amd-ucode",
"siderolabs/amdgpu-firmware",
"siderolabs/bnx2-bnx2x",
"siderolabs/drbd",
"siderolabs/i915-ucode",
"siderolabs/intel-ice-firmware",
"siderolabs/intel-ucode",
"siderolabs/qlogic-firmware",
"siderolabs/zfs"
]
}
}
}'
Save the returned id as SCHEMATIC_ID.
Note
siderolabs/qemu-guest-agent is required for Hetzner Cloud. Add other extensions
(zfs, drbd, etc.) as needed for your workloads.1.2 Configure hcloud CLI
export HCLOUD_TOKEN="<your-hetzner-api-token>"
1.3 Create temporary server in rescue mode
# Create server (without starting)
hcloud server create \
--name talos-image-builder \
--type cpx22 \
--image ubuntu-24.04 \
--location fsn1 \
--ssh-key <your-ssh-key-name> \
--start-after-create=false
# Enable rescue mode and start
hcloud server enable-rescue --type linux64 --ssh-key <your-ssh-key-name> talos-image-builder
hcloud server poweron talos-image-builder
1.4 Write Talos image to disk
# Get server IP
SERVER_IP=$(hcloud server ip talos-image-builder)
# SSH into rescue mode and write image
ssh root@$SERVER_IP
# Inside rescue mode:
wget -O- "https://factory.talos.dev/image/${SCHEMATIC_ID}/<talos-version>/hcloud-amd64.raw.xz" \
| xz -d \
| dd of=/dev/sda bs=4M status=progress
sync
exit
1.5 Create snapshot and cleanup
# Power off and create snapshot
hcloud server poweroff talos-image-builder
hcloud server create-image --type snapshot --description "Talos <talos-version>" talos-image-builder
# Get snapshot ID (save this for later)
hcloud image list --type snapshot
# Delete temporary server
hcloud server delete talos-image-builder
Step 2: Create Hetzner vSwitch (Optional but Recommended)
Create a private network for communication between nodes:
# Create network
hcloud network create --name cozystack-vswitch --ip-range 10.100.0.0/16
# Add subnet for your region (eu-central covers FSN1, NBG1)
hcloud network add-subnet cozystack-vswitch \
--type cloud \
--network-zone eu-central \
--ip-range 10.100.0.0/24
Step 3: Create Talos Machine Config
From your cluster repository, generate a worker config file:
talm template -t templates/worker.yaml --offline --full > nodes/hetzner.yaml
Then edit nodes/hetzner.yaml for Hetzner workers:
- Add Hetzner location metadata (see
Networking Mesh):
machine: nodeAnnotations: kilo.squat.ai/location: hetzner-cloud kilo.squat.ai/persistent-keepalive: "20" nodeLabels: topology.kubernetes.io/zone: hetzner-cloud - Set public Kubernetes API endpoint:
Change
cluster.controlPlane.endpointto the public API server address (for examplehttps://<public-api-ip>:6443). You can find this address in your kubeconfig or publish it via ingress. - Remove discovered installer/network sections:
Delete
machine.installandmachine.networksections from this file. - Set external cloud provider for kubelet (see
Local CCM):
machine: kubelet: extraArgs: cloud-provider: external - Fix node IP subnet detection:
Set
machine.kubelet.nodeIP.validSubnetsto your vSwitch subnet (for example10.100.0.0/24). - (Optional) Add registry mirrors to avoid Docker Hub rate limiting:
machine: registries: mirrors: docker.io: endpoints: - https://mirror.gcr.io
Result should include at least:
machine:
nodeAnnotations:
kilo.squat.ai/location: hetzner-cloud
kilo.squat.ai/persistent-keepalive: "20"
nodeLabels:
topology.kubernetes.io/zone: hetzner-cloud
kubelet:
nodeIP:
validSubnets:
- 10.100.0.0/24 # replace with your vSwitch subnet
extraArgs:
cloud-provider: external
registries:
mirrors:
docker.io:
endpoints:
- https://mirror.gcr.io
cluster:
controlPlane:
endpoint: https://<public-api-ip>:6443
All other settings (cluster tokens, CA, extensions, etc.) remain the same as the generated template.
Step 4: Create Kubernetes Secrets
4.1 Create secret with Hetzner API token
kubectl -n cozy-cluster-autoscaler-hetzner create secret generic hetzner-credentials \
--from-literal=token=<your-hetzner-api-token>
4.2 Create secret with Talos machine config
The machine config must be base64-encoded:
# Encode your worker.yaml (single line base64)
base64 -w 0 -i worker.yaml -o worker.b64
# Create secret
kubectl -n cozy-cluster-autoscaler-hetzner create secret generic talos-config \
--from-file=cloud-init=worker.b64
Step 5: Deploy Cluster Autoscaler
Create the Package resource:
apiVersion: cozystack.io/v1alpha1
kind: Package
metadata:
name: cozystack.cluster-autoscaler-hetzner
spec:
variant: default
components:
cluster-autoscaler-hetzner:
values:
cluster-autoscaler:
autoscalingGroups:
- name: workers-fsn1
minSize: 0
maxSize: 10
instanceType: cpx22
region: FSN1
extraEnv:
HCLOUD_IMAGE: "<snapshot-id>"
HCLOUD_SSH_KEY: "<ssh-key-name>"
HCLOUD_NETWORK: "cozystack-vswitch"
HCLOUD_PUBLIC_IPV4: "true"
HCLOUD_PUBLIC_IPV6: "false"
extraEnvSecrets:
HCLOUD_TOKEN:
name: hetzner-credentials
key: token
HCLOUD_CLOUD_INIT:
name: talos-config
key: cloud-init
Apply:
kubectl apply -f package.yaml
Step 6: Test Autoscaling
Create a deployment with pod anti-affinity to force scale-up:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-autoscaler
spec:
replicas: 5
selector:
matchLabels:
app: test-autoscaler
template:
metadata:
labels:
app: test-autoscaler
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: test-autoscaler
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "100m"
memory: "128Mi"
If you have fewer nodes than replicas, the autoscaler will create new Hetzner servers.
Step 7: Verify
# Check autoscaler logs
kubectl -n cozy-cluster-autoscaler-hetzner logs \
deployment/cluster-autoscaler-hetzner-hetzner-cluster-autoscaler -f
# Check nodes
kubectl get nodes -o wide
# Verify node labels and internal IP
kubectl get node <node-name> --show-labels
Expected result for autoscaled nodes:
- Internal IP from vSwitch range (e.g., 10.100.0.2)
- Label
kilo.squat.ai/location=hetzner-cloud
Configuration Reference
Environment Variables
| Variable | Description | Required |
|---|---|---|
HCLOUD_TOKEN | Hetzner API token | Yes |
HCLOUD_IMAGE | Talos snapshot ID | Yes |
HCLOUD_CLOUD_INIT | Base64-encoded machine config | Yes |
HCLOUD_NETWORK | vSwitch network name/ID | No |
HCLOUD_SSH_KEY | SSH key name/ID | No |
HCLOUD_FIREWALL | Firewall name/ID | No |
HCLOUD_PUBLIC_IPV4 | Assign public IPv4 | No (default: true) |
HCLOUD_PUBLIC_IPV6 | Assign public IPv6 | No (default: false) |
Hetzner Server Types
| Type | vCPU | RAM | Good for |
|---|---|---|---|
| cpx22 | 2 | 4GB | Small workloads |
| cpx32 | 4 | 8GB | General purpose |
| cpx42 | 8 | 16GB | Medium workloads |
| cpx52 | 16 | 32GB | Large workloads |
| ccx13 | 2 dedicated | 8GB | CPU-intensive |
| ccx23 | 4 dedicated | 16GB | CPU-intensive |
| ccx33 | 8 dedicated | 32GB | CPU-intensive |
| cax11 | 2 ARM | 4GB | ARM workloads |
| cax21 | 4 ARM | 8GB | ARM workloads |
Note
Some older server types (cpx11, cpx21, etc.) may be unavailable in certain regions.Hetzner Regions
| Code | Location |
|---|---|
| FSN1 | Falkenstein, Germany |
| NBG1 | Nuremberg, Germany |
| HEL1 | Helsinki, Finland |
| ASH | Ashburn, USA |
| HIL | Hillsboro, USA |
Troubleshooting
Connecting to remote workers for diagnostics
Talos does not allow opening a dashboard directly to worker nodes. Use talm dashboard
to connect through the control plane:
talm dashboard -f nodes/<control-plane>.yaml -n <worker-node-ip>
Where <control-plane>.yaml is your control plane node config and <worker-node-ip> is
the Kubernetes internal IP of the remote worker.
Nodes not joining cluster
- Check VNC console via Hetzner Cloud Console or:
hcloud server request-console <server-name> - Common errors:
- “unknown keys found during decoding”: Check Talos config format.
nodeLabelsgoes undermachine,nodeIPgoes undermachine.kubelet - “kubelet image is not valid”: Kubernetes version mismatch. Use kubelet version compatible with your Talos version
- “failed to load config”: Machine config syntax error
- “unknown keys found during decoding”: Check Talos config format.
Nodes have wrong Internal IP
Ensure machine.kubelet.nodeIP.validSubnets is set to your vSwitch subnet:
machine:
kubelet:
nodeIP:
validSubnets:
- 10.100.0.0/24
Scale-up not triggered
- Check autoscaler logs for errors
- Verify RBAC permissions (leases access required)
- Check if pods are actually pending:
kubectl get pods --field-selector=status.phase=Pending
Registry rate limiting (403 errors)
Add registry mirrors to Talos config:
machine:
registries:
mirrors:
docker.io:
endpoints:
- https://mirror.gcr.io
registry.k8s.io:
endpoints:
- https://registry.k8s.io
Scale-down not working
The autoscaler caches node information for up to 30 minutes. Wait or restart autoscaler:
kubectl -n cozy-cluster-autoscaler-hetzner rollout restart \
deployment cluster-autoscaler-hetzner-hetzner-cluster-autoscaler