Cozystack Troubleshooting Guide
This guide shows the initial steps to check your cluster’s health and discover problems. In the bottom of the page you will find links to troubleshooting guides for various Cozystack components and aspects of cluster operations.
Getting basic information
You can see the logs of main installer by executing:
kubectl logs -n cozy-system deploy/cozystack -f
All the platform components are installed using Flux CD HelmReleases.
You can get all installed HelmReleases:
# kubectl get hr -A
NAMESPACE NAME AGE READY STATUS
cozy-cert-manager cert-manager 4m1s True Release reconciliation succeeded
cozy-cert-manager cert-manager-issuers 4m1s True Release reconciliation succeeded
cozy-cilium cilium 4m1s True Release reconciliation succeeded
cozy-cluster-api capi-operator 4m1s True Release reconciliation succeeded
cozy-cluster-api capi-providers 4m1s True Release reconciliation succeeded
cozy-dashboard dashboard 4m1s True Release reconciliation succeeded
cozy-fluxcd cozy-fluxcd 4m1s True Release reconciliation succeeded
cozy-grafana-operator grafana-operator 4m1s True Release reconciliation succeeded
cozy-kamaji kamaji 4m1s True Release reconciliation succeeded
cozy-kubeovn kubeovn 4m1s True Release reconciliation succeeded
cozy-kubevirt-cdi kubevirt-cdi 4m1s True Release reconciliation succeeded
cozy-kubevirt-cdi kubevirt-cdi-operator 4m1s True Release reconciliation succeeded
cozy-kubevirt kubevirt 4m1s True Release reconciliation succeeded
cozy-kubevirt kubevirt-operator 4m1s True Release reconciliation succeeded
cozy-linstor linstor 4m1s True Release reconciliation succeeded
cozy-linstor piraeus-operator 4m1s True Release reconciliation succeeded
cozy-mariadb-operator mariadb-operator 4m1s True Release reconciliation succeeded
cozy-metallb metallb 4m1s True Release reconciliation succeeded
cozy-monitoring monitoring 4m1s True Release reconciliation succeeded
cozy-postgres-operator postgres-operator 4m1s True Release reconciliation succeeded
cozy-rabbitmq-operator rabbitmq-operator 4m1s True Release reconciliation succeeded
cozy-redis-operator redis-operator 4m1s True Release reconciliation succeeded
cozy-telepresence telepresence 4m1s True Release reconciliation succeeded
cozy-victoria-metrics-operator victoria-metrics-operator 4m1s True Release reconciliation succeeded
tenant-root tenant-root 4m1s True Release reconciliation succeeded
Normaly all of them should be Ready
and Release reconciliation succeeded
Troubleshooting Flux CD
Diagnosing install retries exhausted
error
Sometimes you can face with the error:
# kubectl get hr -A -n cozy-dashboard dashboard
NAMESPACE NAME AGE READY STATUS
cozy-dashboard dashboard 15m False install retries exhausted
You can try to figure out by checking events:
kubectl describe hr -n cozy-dashboard dashboard
if Events: <none>
then suspend and reume the release:
kubectl patch hr -n cozy-dashboard dashboard -p '{"spec": {"suspend": true}}' --type=merge
kubectl patch hr -n cozy-dashboard dashboard -p '{"spec": {"suspend": null}}' --type=merge
and check the events again:
kubectl describe hr -n cozy-dashboard dashboard
Kube-OVN crash
In complex cases, you may encounter issues where the Kube-OVN DaemonSet pods crash or fail to start properly. This usually indicates a corrupted OVN database. You can confirm this by checking the logs of the Kube-OVN CNI pods.
Get the list of pods in cozy-kubeovn
namespace:
# kubectl get pod -n cozy-kubeovn
NAME READY STATUS RESTARTS AGE
kube-ovn-cni-5rsvz 0/1 Running 5 (35s ago) 4m37s
kube-ovn-cni-jq2zz 0/1 Running 5 (33s ago) 4m39s
kube-ovn-cni-p4gz2 0/1 Running 3 (23s ago) 4m38s
Read the logs of a pod by its name (kube-ovn-cni-jq2zz
in this example):
# kubectl logs -n cozy-kubeovn kube-ovn-cni-jq2zz
W0725 08:21:12.479452 87678 ovs.go:35] 100.64.0.4 network not ready after 3 ping to gateway 100.64.0.1
W0725 08:21:15.479600 87678 ovs.go:35] 100.64.0.4 network not ready after 6 ping to gateway 100.64.0.1
W0725 08:21:18.479628 87678 ovs.go:35] 100.64.0.4 network not ready after 9 ping to gateway 100.64.0.1
W0725 08:21:21.479355 87678 ovs.go:35] 100.64.0.4 network not ready after 12 ping to gateway 100.64.0.1
W0725 08:21:24.479322 87678 ovs.go:35] 100.64.0.4 network not ready after 15 ping to gateway 100.64.0.1
W0725 08:21:27.479664 87678 ovs.go:35] 100.64.0.4 network not ready after 18 ping to gateway 100.64.0.1
W0725 08:21:30.478907 87678 ovs.go:35] 100.64.0.4 network not ready after 21 ping to gateway 100.64.0.1
W0725 08:21:33.479738 87678 ovs.go:35] 100.64.0.4 network not ready after 24 ping to gateway 100.64.0.1
W0725 08:21:36.479607 87678 ovs.go:35] 100.64.0.4 network not ready after 27 ping to gateway 100.64.0.1
W0725 08:21:39.479753 87678 ovs.go:35] 100.64.0.4 network not ready after 30 ping to gateway 100.64.0.1
W0725 08:21:42.479480 87678 ovs.go:35] 100.64.0.4 network not ready after 33 ping to gateway 100.64.0.1
W0725 08:21:45.478754 87678 ovs.go:35] 100.64.0.4 network not ready after 36 ping to gateway 100.64.0.1
W0725 08:21:48.479396 87678 ovs.go:35] 100.64.0.4 network not ready after 39 ping to gateway 100.64.0.1
To resolve this issue, you can clean up the OVN database. This involves running a DaemonSet that removes the OVN configuration files from each node. It is safe to perform this cleanup — the Kube-OVN DaemonSet will automatically recreate the necessary files from the Kubernetes API.
Apply the following YAML to deploy the cleanup DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ovn-cleanup
namespace: cozy-kubeovn
spec:
selector:
matchLabels:
app: ovn-cleanup
template:
metadata:
labels:
app: ovn-cleanup
component: network
type: infra
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: ovn-central
topologyKey: "kubernetes.io/hostname"
containers:
- name: cleanup
image: busybox
command: ["/bin/sh", "-xc", "rm -rf /host-config-ovn/*; rm -rf /host-config-ovn/.*; exec sleep infinity"]
volumeMounts:
- name: host-config-ovn
mountPath: /host-config-ovn
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: ""
tolerations:
- operator: "Exists"
volumes:
- name: host-config-ovn
hostPath:
path: /var/lib/ovn
type: ""
hostNetwork: true
restartPolicy: Always
terminationGracePeriodSeconds: 1
Verify that the DaemonSet is running:
# kubectl get pod -n cozy-kubeovn
ovn-cleanup-hjzxb 1/1 Running 0 6s
ovn-cleanup-wmzdv 1/1 Running 0 6s
ovn-cleanup-ztm86 1/1 Running 0 6s
Once the cleanup is complete, delete the DaemonSet and restart the Kube-OVN DaemonSet pods to apply the new configuration:
kubectl delete ds
kubectl get pod -n cozy-kubeovn
Specific Troubleshooting Guides
Cluster Bootstrapping
See the Kubernetes installation troubleshooting.
Cluster Maintenance
Remove a failed node from the cluster
See the Cluster Maintenance > Cluster Scaling.