Custom Resources deleted during Karmada Control Plane Upgrade #6098

mszacillo · 2025-02-07T21:36:26Z

What happened:

Upon upgrading our QA Karmada-control plane to update deployment images and update some member cluster kubeconfigs, two things happened:

The controller-manager crashlooped due to bug, this was fixed and reverted.
Two cluster estimators failed to dial their respective member clusters due to the certs on the kubeconfig being out of date.

When the controller-manager came back up, we noticed that all custom resources (including FlinkDeployments) that had been scheduled on the two member clusters with outdated kubeconfig files were deleted from the Karmada control-plane. Below I've included relevant logs from the controller-manager for identity resource test-dev-identity.

Defaulted container "karmada-controller-manager" out of: karmada-controller-manager, wait (init)
E0205 01:35:28.635126       1 detector.go:677] Failed to get object(identity.com/v1, kind=GenericIdentity, test-namespace/test-dev-identity), error: genericidentities.identity.bloomberg.com "test-dev-identity" not found
I0205 01:35:28.635233       1 detector.go:236] Reconciling object: identity.com/v1, kind=GenericIdentity, test-namespace/test-dev-identity
E0205 01:35:28.637232       1 detector.go:677] Failed to get object(identity.com/v1, kind=GenericIdentity, test-namespace/test-dev-identity), error: genericidentities.identity.bloomberg.com "test-dev-identity" not found
E0205 01:35:28.717906       1 execution_controller.go:311] Failed to get the resource(kind=GenericIdentity, test-namespace/test-dev-identity) from member cluster(nj-dev), err is the informer of cluster(nj-dev) has not been initialized
E0205 01:35:28.717947       1 execution_controller.go:272] Failed to create or update resource(test-namespace/test-dev-identity) in the given member cluster nj-dev, err is the informer of cluster(nj-dev) has not been initialized
I0205 01:35:28.718826       1 recorder.go:104] "Failed to create or update resource(test-namespace/test-dev-identity) in member cluster(nj-dev): the informer of cluster(nj-dev) has not been initialized" logger="events" type="Warning" object={"kind":"GenericIdentity","namespace":"test-namespace","name":"test-dev-identity","uid":"c40a52e2-3551-4472-8397-ef8e38e10eb2","apiVersion":"identity.com/v1"} reason="SyncFailed"
E0205 01:35:28.741614       1 execution_controller.go:151] Failed to sync work(karmada-es-nj-dev/test-dev-identity-5d869594f6) to cluster(nj-dev), err: the informer of cluster(nj-dev) has not been initialized
E0205 01:35:28.743763       1 controller.go:316] "Reconciler error" err="the informer of cluster(nj-dev) has not been initialized" controller="execution-controller" controllerGroup="work.karmada.io" controllerKind="Work" Work="karmada-es-nj-dev/test-dev-identity-5d869594f6" namespace="karmada-es-nj-dev" name="test-dev-identity-5d869594f6" reconcileID="5ded8599-dd47-4bb5-a1ad-480d97eae70a"
I0205 01:35:28.744849       1 recorder.go:104] "Failed to sync work(karmada-es-nj-dev/test-dev-identity-5d869594f6) to cluster(nj-dev), err: the informer of cluster(nj-dev) has not been initialized" logger="events" type="Warning" object={"kind":"Work","namespace":"karmada-es-nj-dev","name":"test-dev-identity-5d869594f6","uid":"b221550d-c9c2-41c0-9379-6b47b2cfe654","apiVersion":"work.karmada.io/v1alpha1","resourceVersion":"24978880"} reason="SyncFailed"
I0205 01:35:29.753077       1 objectwatcher.go:173] Updated the resource(kind=GenericIdentity, test-namespace/test-dev-identity) on cluster(tt-dev).
I0205 01:35:29.753429       1 recorder.go:104] "Successfully applied resource(test-namespace/test-dev-identity) to cluster tt-dev" logger="events" type="Normal" object={"kind":"GenericIdentity","namespace":"test-namespace","name":"test-dev-identity","uid":"c40a52e2-3551-4472-8397-ef8e38e10eb2","apiVersion":"identity.com/v1"} reason="SyncSucceed"
I0205 01:35:29.753526       1 recorder.go:104] "Sync work(karmada-es-tt-dev/test-dev-identity-5d869594f6) to cluster(tt-dev) successful." logger="events" type="Normal" object={"kind":"Work","namespace":"karmada-es-tt-dev","name":"test-dev-identity-5d869594f6","uid":"63f27671-27d6-498a-83b9-2cca75273aa7","apiVersion":"work.karmada.io/v1alpha1","resourceVersion":"24933879"} reason="SyncSucceed"
I0205 01:35:31.650681       1 objectwatcher.go:229] Deleted the resource(kind=GenericIdentity, test-namespace/test-dev-identity) on cluster(ny-dev).
  flink-validate-generic-identity: Generic Identity [test-dev-identity]
  flink-validate-generic-identity: Generic Identity [test-dev-identity]
I0205 01:35:31.728450       1 recorder.go:104] "Failed to create or update resource(test-namespace/ioi-quality) in member cluster(ny-dev): admission webhook \"validate.kyverno.svc-fail\" denied the request: \n\nresource FlinkDeployment/test-namespace/ioi-quality was blocked due to the following policies \n" logger="events" type="Warning" object={"kind":"FlinkDeployment","namespace":"test-namespace","name":"ioi-quality","uid":"d5aa0ed8-f90a-4b48-a30e-cf8215da36a0","apiVersion":"flink.apache.org/v1beta1"} reason="SyncFailed"
  flink-validate-generic-identity: Generic Identity [test-dev-identity]
	  flink-validate-generic-identity: Generic Identity [test-dev-identity]
I0205 01:35:31.738387       1 recorder.go:104] "Failed to sync work(karmada-es-ny-dev/ioi-quality-675b466bfd) to cluster(ny-dev), err: admission webhook \"validate.kyverno.svc-fail\" denied the request: \n\nresource FlinkDeployment/test-namespace/ioi-quality was blocked due to the following policies \n" logger="events" type="Warning" object={"kind":"Work","namespace":"karmada-es-ny-dev","name":"ioi-quality-675b466bfd","uid":"641b1694-2b13-48f2-8f42-7d928ce35212","apiVersion":"work.karmada.io/v1alpha1","resourceVersion":"24979161"} reason="SyncFailed"
I0205 01:35:31.751131       1 objectwatcher.go:229] Deleted the resource(kind=GenericIdentity, test-namespace/test-dev-identity) on cluster(nj-dev).
I0205 01:35:31.767888       1 objectwatcher.go:229] Deleted the resource(kind=GenericIdentity, test-namespace/test-dev-identity) on cluster(tt-dev).

My understanding is that even if Karmada is unable to contact one or more clusters, the worst it should do is just deschedule work on those clusters. I was not under the impression that Karmada would delete the resources from the control-plane altogether.

Did this happen because the controller-manager crashed along with the clusters using outdated kubeconfigs? So when it attempted to initialize informers for the respective clusters it was unable to do so, and it decided to clean up orphan work?

Is there a way to prevent Karmada from deleting resources that have been applied to the control-plane by itself?

What you expected to happen:

Karmada should ideally never delete user-applied resources without the user explicitly deleting the resources themselves. Is there a configuration that we could set to mitigate this?

How to reproduce it (as minimally and precisely as possible):

In our QA setup, we pushed a bad image to the controller-manager and also happened to use two kubeconfigs that had outdated certs. When the controller manager was fixed and brought back up, all resources that had been scheduled on the member clusters with bad kubeconfigs were deleted.

Anything else we need to know?:

Environment:

K8s v1.31
Karmada version: v1.12.3

The text was updated successfully, but these errors were encountered:

RainbowMango · 2025-02-12T08:47:14Z

My understanding is that even if Karmada is unable to contact one or more clusters, the worst it should do is just deschedule work on those clusters. I was not under the impression that Karmada would delete the resources from the control plane altogether.

Yes, you are right, Karmada wouldn't and shouldn't delete users' resources. In this case, the test-dev-identity is created by the user, it can only be removed by users.

RainbowMango · 2025-02-12T08:53:14Z

In our QA setup, we pushed a bad image to the controller-manager and also happened to use two kubeconfigs that had outdated certs.

Can you elaborate on what you did? How bad the image it is? Does the controller-manager go to the crash loop?

The karmada-controller-manager just accepts one config of Karmada, it won't require the kubeconfig of any member clusters directly. So, I need more info to understand what you did.

RainbowMango · 2025-02-12T09:00:21Z

E0205 01:35:28.635126 1 detector.go:677] Failed to get object(identity.com/v1, kind=GenericIdentity, test-namespace/test-dev-identity), error: genericidentities.identity.bloomberg.com "test-dev-identity" not found

This is the first log that shows that the resource was not found from the informer cache and karmada-apiserver. Do you have any operator or something running at that time? which deletes the resource unexpectedly?

mszacillo · 2025-02-13T04:38:46Z

Can you elaborate on what you did? How bad the image it is? Does the controller-manager go to the crash loop?

I was doing a quick rebase of our internal fork (which includes a custom FederatedResourceQuota controller, which I will discuss in the proposed design for that feature). While resolving merge conflicts I missed including one of my changes which caused the controller-manager cmd to CrashLoopBackoff.

Obviously in DEV we have a deployment pipeline which prevents these kinds of bad images from being promoted, but since this was this was QA I was less careful. Additionally we'll soon be relying on the platform @jabellard is helping build, so these types of upgrade errors should be minimized.

Do you have any operator or something running at that time? which deletes the resource unexpectedly?

None that would delete custom resources. I'll try to reproduce this and add more information.

RainbowMango · 2025-02-13T06:16:03Z

None that would delete custom resources. I'll try to reproduce this and add more information.

OK. By the way, if you have the karmada-apiserver audit log, you can find when and by whom those resources were deleted.

mszacillo added the kind/bug Categorizes issue or PR as related to a bug. label Feb 7, 2025

github-project-automation bot added this to Karmada Overall Backlog Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Resources deleted during Karmada Control Plane Upgrade #6098

Custom Resources deleted during Karmada Control Plane Upgrade #6098

mszacillo commented Feb 7, 2025

RainbowMango commented Feb 12, 2025

RainbowMango commented Feb 12, 2025

RainbowMango commented Feb 12, 2025

mszacillo commented Feb 13, 2025

RainbowMango commented Feb 13, 2025

Custom Resources deleted during Karmada Control Plane Upgrade #6098

Custom Resources deleted during Karmada Control Plane Upgrade #6098

Comments

mszacillo commented Feb 7, 2025

RainbowMango commented Feb 12, 2025

RainbowMango commented Feb 12, 2025

RainbowMango commented Feb 12, 2025

mszacillo commented Feb 13, 2025

RainbowMango commented Feb 13, 2025