Cluster Scaling: Adding and Removing Nodes
How to add a node to a Cozystack cluster
Adding a node is done in a way similar to regular Cozystack installation.
Install Talos on the node, using the Cozystack’s custom-built Talos image.
Generate the configuration for the new node, using the Talm or talosctl guide.
For example, configuring a control plane node:
talm template -e 192.168.123.20 -n 192.168.123.20 -t templates/controlplane.yaml -i > nodes/nodeN.yaml
and for a worker node:
talm template -e 192.168.123.20 -n 192.168.123.20 -t templates/worker.yaml -i > nodes/nodeN.yaml
Apply the generated configuration to the node, using the Talm or talosctl guide. For example:
talm apply -f nodes/nodeN.yaml -i
Wait for the node to reboot and bootstrap itself to the cluster. You don’t need to bootstrap it manually or to install Cozystack on it, as it is all done automatically.
You can check the result with
kubectl get nodes
.
How to remove a node from a Cozystack cluster
When a cluster node fails, Cozystack automatically handles high availability by recreating replicated PVCs and workloads on other nodes. However, there can be issues that require removing the node to resolve:
Local storage PVs may remain bound to the failed node, which can cause issues with new pods. These need to be cleaned up manually.
The failed node will still exist in the cluster, which can lead to inconsistencies in the cluster state and affect pod scheduling.
Step 1: Remove the Node from the Cluster
Run the following command to remove the failed node (replace mynode with the actual node name):
kubectl delete node mynode
If the failed node is a control-plane node, you must also remove its etcd member from the etcd cluster:
talm -f nodes/node1.yaml etcd member list
Example output:
NODE ID HOSTNAME PEER URLS CLIENT URLS LEARNER
37.27.60.28 2ba6e48b8cf1a0c1 node1 https://192.168.100.11:2380 https://192.168.100.11:2379 false
37.27.60.28 b82e2194fb76ee42 node2 https://192.168.100.12:2380 https://192.168.100.12:2379 false
37.27.60.28 f24f4de3d01e5e88 node3 https://192.168.100.13:2380 https://192.168.100.13:2379 false
Then remove the corresponding member (replace the ID with the one for your failed node):
talm -f nodes/node1.yaml etcd remove-member f24f4de3d01e5e88
Step 2: Remove PVCs and Pods Bound to the Failed Node
Here are few commands to help you clean up the failed node:
Delete PVCs bound to the failed node:
(Replacemynode
with the name of your failed node)kubectl get pv -o json | jq -r '.items[] | select(.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == "mynode").spec.claimRef | "kubectl delete pvc -n \(.namespace) \(.name)"' | sh -x
Delete pods stuck in
Pending
state across all namespaces:kubectl get pod -A | awk '/Pending/ {print "kubectl delete pod -n " $1 " " $2}' | sh -x
Step 3: Check Resource Status
After cleanup, check for any resource issues using linstor advise
:
# linstor advise resource
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Resource ┊ Issue ┊ Possible fix ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-02b0c0a1-e0b6-4e98-9384-60ff24f3b3b6 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-02b0c0a1-e0b6-4e98-9384-60ff24f3b3b6 ┊
┊ pvc-06e3b406-23f0-4f10-8b03-84063c1b2a12 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-06e3b406-23f0-4f10-8b03-84063c1b2a12 ┊
┊ pvc-a0b8aeaf-076e-4bd9-93ed-c4db09c04d0b ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-a0b8aeaf-076e-4bd9-93ed-c4db09c04d0b ┊
┊ pvc-a523ebeb-c3b6-468d-abe5-f6afbbf31081 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-a523ebeb-c3b6-468d-abe5-f6afbbf31081 ┊
┊ pvc-cf7e87b5-3e6d-4034-903d-4625830fb5b4 ┊ Resource expected to have 1 replicas, got only 0. ┊ linstor rd ap --place-count 1 pvc-cf7e87b5-3e6d-4034-903d-4625830fb5b4 ┊
┊ pvc-d344bc83-97fd-4489-bbe7-5399eea57165 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-d344bc83-97fd-4489-bbe7-5399eea57165 ┊
┊ pvc-d39345a9-5446-4c64-a5ba-957ff7c7a31f ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-d39345a9-5446-4c64-a5ba-957ff7c7a31f ┊
┊ pvc-db6d4236-93bd-4268-9dcc-0ed275b17067 ┊ Resource expected to have 1 replicas, got only 0. ┊ linstor rd ap --place-count 1 pvc-db6d4236-93bd-4268-9dcc-0ed275b17067 ┊
┊ pvc-ebb412c3-083c-4eee-93dc-70917ea6d87e ┊ Resource expected to have 1 replicas, got only 0. ┊ linstor rd ap --place-count 1 pvc-ebb412c3-083c-4eee-93dc-70917ea6d87e ┊
┊ pvc-f107aacb-78d7-4ac6-97f8-8ed529a9c292 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-f107aacb-78d7-4ac6-97f8-8ed529a9c292 ┊
┊ pvc-f347d71a-b646-45e5-a717-f0a745061beb ┊ Resource expected to have 1 replicas, got only 0. ┊ linstor rd ap --place-count 1 pvc-f347d71a-b646-45e5-a717-f0a745061beb ┊
┊ pvc-f6e96c83-6144-4510-b0ab-61936db52391 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-f6e96c83-6144-4510-b0ab-61936db52391 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Run the linstor rd ap
commands suggested in the “Possible fix” column to restore the desired replica count.