Troubleshooting cheatsheet
Cozystack
Getting basic information
You can see the logs of main installer by executing:
kubectl logs -n cozy-system deploy/cozystack -f
All the platform components are installed using fluxcd HelmRelases.
You can get all installed HelmRelases:
# kubectl get hr -A
NAMESPACE NAME AGE READY STATUS
cozy-cert-manager cert-manager 4m1s True Release reconciliation succeeded
cozy-cert-manager cert-manager-issuers 4m1s True Release reconciliation succeeded
cozy-cilium cilium 4m1s True Release reconciliation succeeded
cozy-cluster-api capi-operator 4m1s True Release reconciliation succeeded
cozy-cluster-api capi-providers 4m1s True Release reconciliation succeeded
cozy-dashboard dashboard 4m1s True Release reconciliation succeeded
cozy-fluxcd cozy-fluxcd 4m1s True Release reconciliation succeeded
cozy-grafana-operator grafana-operator 4m1s True Release reconciliation succeeded
cozy-kamaji kamaji 4m1s True Release reconciliation succeeded
cozy-kubeovn kubeovn 4m1s True Release reconciliation succeeded
cozy-kubevirt-cdi kubevirt-cdi 4m1s True Release reconciliation succeeded
cozy-kubevirt-cdi kubevirt-cdi-operator 4m1s True Release reconciliation succeeded
cozy-kubevirt kubevirt 4m1s True Release reconciliation succeeded
cozy-kubevirt kubevirt-operator 4m1s True Release reconciliation succeeded
cozy-linstor linstor 4m1s True Release reconciliation succeeded
cozy-linstor piraeus-operator 4m1s True Release reconciliation succeeded
cozy-mariadb-operator mariadb-operator 4m1s True Release reconciliation succeeded
cozy-metallb metallb 4m1s True Release reconciliation succeeded
cozy-monitoring monitoring 4m1s True Release reconciliation succeeded
cozy-postgres-operator postgres-operator 4m1s True Release reconciliation succeeded
cozy-rabbitmq-operator rabbitmq-operator 4m1s True Release reconciliation succeeded
cozy-redis-operator redis-operator 4m1s True Release reconciliation succeeded
cozy-telepresence telepresence 4m1s True Release reconciliation succeeded
cozy-victoria-metrics-operator victoria-metrics-operator 4m1s True Release reconciliation succeeded
tenant-root tenant-root 4m1s True Release reconciliation succeeded
Normaly all of them should be Ready
and Release reconciliation succeeded
Diagnosing install retries exhausted
error
Sometimes you can face with the error:
# kubectl get hr -A -n cozy-dashboard dashboard
NAMESPACE NAME AGE READY STATUS
cozy-dashboard dashboard 15m False install retries exhausted
You can try to figure out by checking events:
kubectl describe hr -n cozy-dashboard dashboard
if Events: <none>
then suspend and reume the release:
kubectl patch hr -n cozy-dashboard dashboard -p '{"spec": {"suspend": true}}' --type=merge
kubectl patch hr -n cozy-dashboard dashboard -p '{"spec": {"suspend": null}}' --type=merge
and check the events again:
kubectl describe hr -n cozy-dashboard dashboard
Cluster bootstrapping
Errors that can occur when bootstrapping a cluster with talos-bootstrap
, talm
, or talosctl
:
No Talos nodes in maintenance mode found!
If you encounter issues with the talos-bootstrap
script not detecting any nodes, follow these steps to diagnose and resolve the issue:
Verify Network Segment
Ensure that you are running the script within the same network segment as the nodes. This is crucial for the script to be able to communicate with the nodes.
Use Nmap to Discover Nodes
Check if nmap
can discover your node by running the following command:
nmap -Pn -n -p 50000 192.168.0.0/24
This command scans for nodes in the network that are listening on port 50000
.
The output should list all the nodes in the network segment that are listening on this port, indicating that they are reachable.
Verify talosctl Connectivity
Next, verify that talosctl
can connect to a specific node, especially if the node is in maintenance mode:
talosctl -e "${node}" -n "${node}" get machinestatus -i
Receiving an error like the following usually means your local talosctl
binary is outdated:
rpc error: code = Unimplemented desc = unknown service resource.ResourceService
Updating talosctl
to the latest version should resolve this issue.
Run talos-bootstrap in debug mode
If the previous steps don’t help, run talos-bootstrap
in debug mode to gain more insight.
Execute the script with the -x
option to enable debug mode:
bash -x talos-bootstrap
Pay attention to the last command displayed before the error; it often indicates the command that failed and can provide clues for further troubleshooting.
Kube-OVN crash
In complex cases, you may encounter issues where the Kube-OVN DaemonSet pods crash or fail to start properly. This usually indicates a corrupted OVN database. You can confirm this by checking the logs of the Kube-OVN CNI pods.
Get the list of pods in cozy-kubeovn
namespace:
# kubectl get pod -n cozy-kubeovn
NAME READY STATUS RESTARTS AGE
kube-ovn-cni-5rsvz 0/1 Running 5 (35s ago) 4m37s
kube-ovn-cni-jq2zz 0/1 Running 5 (33s ago) 4m39s
kube-ovn-cni-p4gz2 0/1 Running 3 (23s ago) 4m38s
Read the logs of a pod by its name (kube-ovn-cni-jq2zz
in this example):
# kubectl logs -n cozy-kubeovn kube-ovn-cni-jq2zz
W0725 08:21:12.479452 87678 ovs.go:35] 100.64.0.4 network not ready after 3 ping to gateway 100.64.0.1
W0725 08:21:15.479600 87678 ovs.go:35] 100.64.0.4 network not ready after 6 ping to gateway 100.64.0.1
W0725 08:21:18.479628 87678 ovs.go:35] 100.64.0.4 network not ready after 9 ping to gateway 100.64.0.1
W0725 08:21:21.479355 87678 ovs.go:35] 100.64.0.4 network not ready after 12 ping to gateway 100.64.0.1
W0725 08:21:24.479322 87678 ovs.go:35] 100.64.0.4 network not ready after 15 ping to gateway 100.64.0.1
W0725 08:21:27.479664 87678 ovs.go:35] 100.64.0.4 network not ready after 18 ping to gateway 100.64.0.1
W0725 08:21:30.478907 87678 ovs.go:35] 100.64.0.4 network not ready after 21 ping to gateway 100.64.0.1
W0725 08:21:33.479738 87678 ovs.go:35] 100.64.0.4 network not ready after 24 ping to gateway 100.64.0.1
W0725 08:21:36.479607 87678 ovs.go:35] 100.64.0.4 network not ready after 27 ping to gateway 100.64.0.1
W0725 08:21:39.479753 87678 ovs.go:35] 100.64.0.4 network not ready after 30 ping to gateway 100.64.0.1
W0725 08:21:42.479480 87678 ovs.go:35] 100.64.0.4 network not ready after 33 ping to gateway 100.64.0.1
W0725 08:21:45.478754 87678 ovs.go:35] 100.64.0.4 network not ready after 36 ping to gateway 100.64.0.1
W0725 08:21:48.479396 87678 ovs.go:35] 100.64.0.4 network not ready after 39 ping to gateway 100.64.0.1
To resolve this issue, you can clean up the OVN database. This involves running a DaemonSet that removes the OVN configuration files from each node. It is safe to perform this cleanup — the Kube-OVN DaemonSet will automatically recreate the necessary files from the Kubernetes API.
Apply the following YAML to deploy the cleanup DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ovn-cleanup
namespace: cozy-kubeovn
spec:
selector:
matchLabels:
app: ovn-cleanup
template:
metadata:
labels:
app: ovn-cleanup
component: network
type: infra
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: ovn-central
topologyKey: "kubernetes.io/hostname"
containers:
- name: cleanup
image: busybox
command: ["/bin/sh", "-xc", "rm -rf /host-config-ovn/*; rm -rf /host-config-ovn/.*; exec sleep infinity"]
volumeMounts:
- name: host-config-ovn
mountPath: /host-config-ovn
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: ""
tolerations:
- operator: "Exists"
volumes:
- name: host-config-ovn
hostPath:
path: /var/lib/ovn
type: ""
hostNetwork: true
restartPolicy: Always
terminationGracePeriodSeconds: 1
Verify that the DaemonSet is running:
kubectl get pod -n cozy-kubeovn
ovn-cleanup-hjzxb 1/1 Running 0 6s
ovn-cleanup-wmzdv 1/1 Running 0 6s
ovn-cleanup-ztm86 1/1 Running 0 6s
Once the cleanup is complete, delete the DaemonSet and restart the Kube-OVN DaemonSet pods to apply the new configuration:
kubectl delete ds
kubectl get pod -n cozy-kubeovn
Remove a failed node from the cluster
When a cluster node fails, Cozystack automatically handles high availability by recreating replicated PVCs and workloads on other nodes. However, there can be issues that require removing the node to resolve:
Local storage PVs may remain bound to the failed node, which can cause issues with new pods. These need to be cleaned up manually.
The failed node will still exist in the cluster, which can lead to inconsistencies in the cluster state and affect pod scheduling.
Step 1: Remove the Node from the Cluster
Run the following command to remove the failed node (replace mynode with the actual node name):
kubectl delete node mynode
If the failed node is a control-plane node, you must also remove its etcd member from the etcd cluster:
talm -f nodes/node1.yaml etcd member list
Example output:
NODE ID HOSTNAME PEER URLS CLIENT URLS LEARNER
37.27.60.28 2ba6e48b8cf1a0c1 node1 https://192.168.100.11:2380 https://192.168.100.11:2379 false
37.27.60.28 b82e2194fb76ee42 node2 https://192.168.100.12:2380 https://192.168.100.12:2379 false
37.27.60.28 f24f4de3d01e5e88 node3 https://192.168.100.13:2380 https://192.168.100.13:2379 false
Then remove the corresponding member (replace the ID with the one for your failed node):
talm -f nodes/node1.yaml etcd remove-member f24f4de3d01e5e88
Step 2: Remove PVCs and Pods Bound to the Failed Node
Here are few commands to help you clean up the failed node:
Delete PVCs bound to the failed node:
(Replacemynode
with the name of your failed node)kubectl get pv -o json | jq -r '.items[] | select(.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == "mynode").spec.claimRef | "kubectl delete pvc -n \(.namespace) \(.name)"' | sh -x
Delete pods stuck in
Pending
state across all namespaces:kubectl get pod -A | awk '/Pending/ {print "kubectl delete pod -n " $1 " " $2}' | sh -x
Step 3: Check Resource Status
After cleanup, check for any resource issues using linstor advise
:
# linstor advise resource
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Resource ┊ Issue ┊ Possible fix ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-02b0c0a1-e0b6-4e98-9384-60ff24f3b3b6 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-02b0c0a1-e0b6-4e98-9384-60ff24f3b3b6 ┊
┊ pvc-06e3b406-23f0-4f10-8b03-84063c1b2a12 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-06e3b406-23f0-4f10-8b03-84063c1b2a12 ┊
┊ pvc-a0b8aeaf-076e-4bd9-93ed-c4db09c04d0b ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-a0b8aeaf-076e-4bd9-93ed-c4db09c04d0b ┊
┊ pvc-a523ebeb-c3b6-468d-abe5-f6afbbf31081 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-a523ebeb-c3b6-468d-abe5-f6afbbf31081 ┊
┊ pvc-cf7e87b5-3e6d-4034-903d-4625830fb5b4 ┊ Resource expected to have 1 replicas, got only 0. ┊ linstor rd ap --place-count 1 pvc-cf7e87b5-3e6d-4034-903d-4625830fb5b4 ┊
┊ pvc-d344bc83-97fd-4489-bbe7-5399eea57165 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-d344bc83-97fd-4489-bbe7-5399eea57165 ┊
┊ pvc-d39345a9-5446-4c64-a5ba-957ff7c7a31f ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-d39345a9-5446-4c64-a5ba-957ff7c7a31f ┊
┊ pvc-db6d4236-93bd-4268-9dcc-0ed275b17067 ┊ Resource expected to have 1 replicas, got only 0. ┊ linstor rd ap --place-count 1 pvc-db6d4236-93bd-4268-9dcc-0ed275b17067 ┊
┊ pvc-ebb412c3-083c-4eee-93dc-70917ea6d87e ┊ Resource expected to have 1 replicas, got only 0. ┊ linstor rd ap --place-count 1 pvc-ebb412c3-083c-4eee-93dc-70917ea6d87e ┊
┊ pvc-f107aacb-78d7-4ac6-97f8-8ed529a9c292 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-f107aacb-78d7-4ac6-97f8-8ed529a9c292 ┊
┊ pvc-f347d71a-b646-45e5-a717-f0a745061beb ┊ Resource expected to have 1 replicas, got only 0. ┊ linstor rd ap --place-count 1 pvc-f347d71a-b646-45e5-a717-f0a745061beb ┊
┊ pvc-f6e96c83-6144-4510-b0ab-61936db52391 ┊ Resource expected to have 3 replicas, got only 2. ┊ linstor rd ap --place-count 3 pvc-f6e96c83-6144-4510-b0ab-61936db52391 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Run the linstor rd ap
commands suggested in the “Possible fix” column to restore the desired replica count.