Cozystack Troubleshooting Guide

Showcase various ways to get more information out of Flux controllers to debug potential problems.

This guide shows the initial steps to check your cluster’s health and discover problems. In the bottom of the page you will find links to troubleshooting guides for various Cozystack components and aspects of cluster operations.

Getting basic information

You can see the logs of main installer by executing:

kubectl logs -n cozy-system deploy/cozystack -f

All the platform components are installed using Flux CD HelmReleases.

You can get all installed HelmReleases:

# kubectl get hr -A
NAMESPACE                        NAME                        AGE    READY   STATUS
cozy-cert-manager                cert-manager                4m1s   True    Release reconciliation succeeded
cozy-cert-manager                cert-manager-issuers        4m1s   True    Release reconciliation succeeded
cozy-cilium                      cilium                      4m1s   True    Release reconciliation succeeded
cozy-cluster-api                 capi-operator               4m1s   True    Release reconciliation succeeded
cozy-cluster-api                 capi-providers              4m1s   True    Release reconciliation succeeded
cozy-dashboard                   dashboard                   4m1s   True    Release reconciliation succeeded
cozy-fluxcd                      cozy-fluxcd                 4m1s   True    Release reconciliation succeeded
cozy-grafana-operator            grafana-operator            4m1s   True    Release reconciliation succeeded
cozy-kamaji                      kamaji                      4m1s   True    Release reconciliation succeeded
cozy-kubeovn                     kubeovn                     4m1s   True    Release reconciliation succeeded
cozy-kubevirt-cdi                kubevirt-cdi                4m1s   True    Release reconciliation succeeded
cozy-kubevirt-cdi                kubevirt-cdi-operator       4m1s   True    Release reconciliation succeeded
cozy-kubevirt                    kubevirt                    4m1s   True    Release reconciliation succeeded
cozy-kubevirt                    kubevirt-operator           4m1s   True    Release reconciliation succeeded
cozy-linstor                     linstor                     4m1s   True    Release reconciliation succeeded
cozy-linstor                     piraeus-operator            4m1s   True    Release reconciliation succeeded
cozy-mariadb-operator            mariadb-operator            4m1s   True    Release reconciliation succeeded
cozy-metallb                     metallb                     4m1s   True    Release reconciliation succeeded
cozy-monitoring                  monitoring                  4m1s   True    Release reconciliation succeeded
cozy-postgres-operator           postgres-operator           4m1s   True    Release reconciliation succeeded
cozy-rabbitmq-operator           rabbitmq-operator           4m1s   True    Release reconciliation succeeded
cozy-redis-operator              redis-operator              4m1s   True    Release reconciliation succeeded
cozy-telepresence                telepresence                4m1s   True    Release reconciliation succeeded
cozy-victoria-metrics-operator   victoria-metrics-operator   4m1s   True    Release reconciliation succeeded
tenant-root                      tenant-root                 4m1s   True    Release reconciliation succeeded

Normaly all of them should be Ready and Release reconciliation succeeded

Troubleshooting Flux CD

Diagnosing install retries exhausted error

Sometimes you can face with the error:

# kubectl get hr -A -n cozy-dashboard dashboard
NAMESPACE          NAME          AGE   READY   STATUS
cozy-dashboard     dashboard     15m   False   install retries exhausted

You can try to figure out by checking events:

kubectl describe hr -n cozy-dashboard dashboard

if Events: <none> then suspend and reume the release:

kubectl patch hr -n cozy-dashboard dashboard -p '{"spec": {"suspend": true}}' --type=merge
kubectl patch hr -n cozy-dashboard dashboard -p '{"spec": {"suspend": null}}' --type=merge

and check the events again:

kubectl describe hr -n cozy-dashboard dashboard

Kube-OVN crash

In complex cases, you may encounter issues where the Kube-OVN DaemonSet pods crash or fail to start properly. This usually indicates a corrupted OVN database. You can confirm this by checking the logs of the Kube-OVN CNI pods.

Get the list of pods in cozy-kubeovn namespace:

# kubectl get pod -n cozy-kubeovn
NAME                                   READY   STATUS              RESTARTS       AGE
kube-ovn-cni-5rsvz                     0/1     Running             5 (35s ago)    4m37s
kube-ovn-cni-jq2zz                     0/1     Running             5 (33s ago)    4m39s
kube-ovn-cni-p4gz2                     0/1     Running             3 (23s ago)    4m38s

Read the logs of a pod by its name (kube-ovn-cni-jq2zz in this example):

# kubectl logs -n cozy-kubeovn kube-ovn-cni-jq2zz
W0725 08:21:12.479452   87678 ovs.go:35] 100.64.0.4 network not ready after 3 ping to gateway 100.64.0.1
W0725 08:21:15.479600   87678 ovs.go:35] 100.64.0.4 network not ready after 6 ping to gateway 100.64.0.1
W0725 08:21:18.479628   87678 ovs.go:35] 100.64.0.4 network not ready after 9 ping to gateway 100.64.0.1
W0725 08:21:21.479355   87678 ovs.go:35] 100.64.0.4 network not ready after 12 ping to gateway 100.64.0.1
W0725 08:21:24.479322   87678 ovs.go:35] 100.64.0.4 network not ready after 15 ping to gateway 100.64.0.1
W0725 08:21:27.479664   87678 ovs.go:35] 100.64.0.4 network not ready after 18 ping to gateway 100.64.0.1
W0725 08:21:30.478907   87678 ovs.go:35] 100.64.0.4 network not ready after 21 ping to gateway 100.64.0.1
W0725 08:21:33.479738   87678 ovs.go:35] 100.64.0.4 network not ready after 24 ping to gateway 100.64.0.1
W0725 08:21:36.479607   87678 ovs.go:35] 100.64.0.4 network not ready after 27 ping to gateway 100.64.0.1
W0725 08:21:39.479753   87678 ovs.go:35] 100.64.0.4 network not ready after 30 ping to gateway 100.64.0.1
W0725 08:21:42.479480   87678 ovs.go:35] 100.64.0.4 network not ready after 33 ping to gateway 100.64.0.1
W0725 08:21:45.478754   87678 ovs.go:35] 100.64.0.4 network not ready after 36 ping to gateway 100.64.0.1
W0725 08:21:48.479396   87678 ovs.go:35] 100.64.0.4 network not ready after 39 ping to gateway 100.64.0.1

To resolve this issue, you can clean up the OVN database. This involves running a DaemonSet that removes the OVN configuration files from each node. It is safe to perform this cleanup — the Kube-OVN DaemonSet will automatically recreate the necessary files from the Kubernetes API.

Apply the following YAML to deploy the cleanup DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ovn-cleanup
  namespace: cozy-kubeovn
spec:
  selector:
    matchLabels:
      app: ovn-cleanup
  template:
    metadata:
      labels:
        app: ovn-cleanup
        component: network
        type: infra
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: ovn-central
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: cleanup
        image: busybox
        command: ["/bin/sh", "-xc", "rm -rf /host-config-ovn/*; rm -rf /host-config-ovn/.*; exec sleep infinity"]
        volumeMounts:
        - name: host-config-ovn
          mountPath: /host-config-ovn
      nodeSelector:
        kubernetes.io/os: linux
        node-role.kubernetes.io/control-plane: ""
      tolerations:
      - operator: "Exists"
      volumes:
      - name: host-config-ovn
        hostPath:
          path: /var/lib/ovn
          type: ""
      hostNetwork: true
      restartPolicy: Always
      terminationGracePeriodSeconds: 1

Verify that the DaemonSet is running:

# kubectl get pod -n cozy-kubeovn
ovn-cleanup-hjzxb                      1/1     Running             0              6s
ovn-cleanup-wmzdv                      1/1     Running             0              6s
ovn-cleanup-ztm86                      1/1     Running             0              6s

Once the cleanup is complete, delete the DaemonSet and restart the Kube-OVN DaemonSet pods to apply the new configuration:

kubectl delete ds
kubectl get pod -n cozy-kubeovn

Specific Troubleshooting Guides

Cluster Bootstrapping

See the Kubernetes installation troubleshooting.

Cluster Maintenance

Remove a failed node from the cluster

See the Cluster Maintenance > Cluster Scaling.