Troubleshooting Kube-OVN

Explains how to resolve Kube-OVN crashes caused by a corrupted OVN database.

In complex cases, you may encounter issues where the Kube-OVN DaemonSet pods crash or fail to start properly. This usually indicates a corrupted OVN database. You can confirm this by checking the logs of the Kube-OVN CNI pods.

Get the list of pods in cozy-kubeovn namespace:

# kubectl get pod -n cozy-kubeovn
NAME                                   READY   STATUS              RESTARTS       AGE
kube-ovn-cni-5rsvz                     0/1     Running             5 (35s ago)    4m37s
kube-ovn-cni-jq2zz                     0/1     Running             5 (33s ago)    4m39s
kube-ovn-cni-p4gz2                     0/1     Running             3 (23s ago)    4m38s

Read the logs of a pod by its name (kube-ovn-cni-jq2zz in this example):

# kubectl logs -n cozy-kubeovn kube-ovn-cni-jq2zz
W0725 08:21:12.479452   87678 ovs.go:35] 100.64.0.4 network not ready after 3 ping to gateway 100.64.0.1
W0725 08:21:15.479600   87678 ovs.go:35] 100.64.0.4 network not ready after 6 ping to gateway 100.64.0.1
W0725 08:21:18.479628   87678 ovs.go:35] 100.64.0.4 network not ready after 9 ping to gateway 100.64.0.1
W0725 08:21:21.479355   87678 ovs.go:35] 100.64.0.4 network not ready after 12 ping to gateway 100.64.0.1
W0725 08:21:24.479322   87678 ovs.go:35] 100.64.0.4 network not ready after 15 ping to gateway 100.64.0.1
W0725 08:21:27.479664   87678 ovs.go:35] 100.64.0.4 network not ready after 18 ping to gateway 100.64.0.1
W0725 08:21:30.478907   87678 ovs.go:35] 100.64.0.4 network not ready after 21 ping to gateway 100.64.0.1
W0725 08:21:33.479738   87678 ovs.go:35] 100.64.0.4 network not ready after 24 ping to gateway 100.64.0.1
W0725 08:21:36.479607   87678 ovs.go:35] 100.64.0.4 network not ready after 27 ping to gateway 100.64.0.1
W0725 08:21:39.479753   87678 ovs.go:35] 100.64.0.4 network not ready after 30 ping to gateway 100.64.0.1
W0725 08:21:42.479480   87678 ovs.go:35] 100.64.0.4 network not ready after 33 ping to gateway 100.64.0.1
W0725 08:21:45.478754   87678 ovs.go:35] 100.64.0.4 network not ready after 36 ping to gateway 100.64.0.1
W0725 08:21:48.479396   87678 ovs.go:35] 100.64.0.4 network not ready after 39 ping to gateway 100.64.0.1

To resolve this issue, you can clean up the OVN database. This involves running a DaemonSet that removes the OVN configuration files from each node. It is safe to perform this cleanup — the Kube-OVN DaemonSet will automatically recreate the necessary files from the Kubernetes API.

Apply the following YAML to deploy the cleanup DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ovn-cleanup
  namespace: cozy-kubeovn
spec:
  selector:
    matchLabels:
      app: ovn-cleanup
  template:
    metadata:
      labels:
        app: ovn-cleanup
        component: network
        type: infra
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: ovn-central
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: cleanup
        image: busybox
        command: ["/bin/sh", "-xc", "rm -rf /host-config-ovn/*; rm -rf /host-config-ovn/.*; exec sleep infinity"]
        volumeMounts:
        - name: host-config-ovn
          mountPath: /host-config-ovn
      nodeSelector:
        kubernetes.io/os: linux
        node-role.kubernetes.io/control-plane: ""
      tolerations:
      - operator: "Exists"
      volumes:
      - name: host-config-ovn
        hostPath:
          path: /var/lib/ovn
          type: ""
      hostNetwork: true
      restartPolicy: Always
      terminationGracePeriodSeconds: 1

Verify that the DaemonSet is running:

# kubectl get pod -n cozy-kubeovn
ovn-cleanup-hjzxb                      1/1     Running             0              6s
ovn-cleanup-wmzdv                      1/1     Running             0              6s
ovn-cleanup-ztm86                      1/1     Running             0              6s

Once the cleanup is complete, delete the ovn-cleanup DaemonSet and restart the Kube-OVN CNI pods to apply the new configuration:

# Delete the cleanup DaemonSet
kubectl -n cozy-kubeovn delete ds ovn-cleanup

# Restart Kube-OVN CNI pods by deleting them
kubectl -n cozy-kubeovn rollout restart ds kube-ovn-cni