Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Resources deleted during Karmada Control Plane Upgrade #6098

Open
mszacillo opened this issue Feb 7, 2025 · 5 comments
Open

Custom Resources deleted during Karmada Control Plane Upgrade #6098

mszacillo opened this issue Feb 7, 2025 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mszacillo
Copy link
Contributor

What happened:

Upon upgrading our QA Karmada-control plane to update deployment images and update some member cluster kubeconfigs, two things happened:

  1. The controller-manager crashlooped due to bug, this was fixed and reverted.
  2. Two cluster estimators failed to dial their respective member clusters due to the certs on the kubeconfig being out of date.

When the controller-manager came back up, we noticed that all custom resources (including FlinkDeployments) that had been scheduled on the two member clusters with outdated kubeconfig files were deleted from the Karmada control-plane. Below I've included relevant logs from the controller-manager for identity resource test-dev-identity.

Defaulted container "karmada-controller-manager" out of: karmada-controller-manager, wait (init)
E0205 01:35:28.635126       1 detector.go:677] Failed to get object(identity.com/v1, kind=GenericIdentity, test-namespace/test-dev-identity), error: genericidentities.identity.bloomberg.com "test-dev-identity" not found
I0205 01:35:28.635233       1 detector.go:236] Reconciling object: identity.com/v1, kind=GenericIdentity, test-namespace/test-dev-identity
E0205 01:35:28.637232       1 detector.go:677] Failed to get object(identity.com/v1, kind=GenericIdentity, test-namespace/test-dev-identity), error: genericidentities.identity.bloomberg.com "test-dev-identity" not found
E0205 01:35:28.717906       1 execution_controller.go:311] Failed to get the resource(kind=GenericIdentity, test-namespace/test-dev-identity) from member cluster(nj-dev), err is the informer of cluster(nj-dev) has not been initialized
E0205 01:35:28.717947       1 execution_controller.go:272] Failed to create or update resource(test-namespace/test-dev-identity) in the given member cluster nj-dev, err is the informer of cluster(nj-dev) has not been initialized
I0205 01:35:28.718826       1 recorder.go:104] "Failed to create or update resource(test-namespace/test-dev-identity) in member cluster(nj-dev): the informer of cluster(nj-dev) has not been initialized" logger="events" type="Warning" object={"kind":"GenericIdentity","namespace":"test-namespace","name":"test-dev-identity","uid":"c40a52e2-3551-4472-8397-ef8e38e10eb2","apiVersion":"identity.com/v1"} reason="SyncFailed"
E0205 01:35:28.741614       1 execution_controller.go:151] Failed to sync work(karmada-es-nj-dev/test-dev-identity-5d869594f6) to cluster(nj-dev), err: the informer of cluster(nj-dev) has not been initialized
E0205 01:35:28.743763       1 controller.go:316] "Reconciler error" err="the informer of cluster(nj-dev) has not been initialized" controller="execution-controller" controllerGroup="work.karmada.io" controllerKind="Work" Work="karmada-es-nj-dev/test-dev-identity-5d869594f6" namespace="karmada-es-nj-dev" name="test-dev-identity-5d869594f6" reconcileID="5ded8599-dd47-4bb5-a1ad-480d97eae70a"
I0205 01:35:28.744849       1 recorder.go:104] "Failed to sync work(karmada-es-nj-dev/test-dev-identity-5d869594f6) to cluster(nj-dev), err: the informer of cluster(nj-dev) has not been initialized" logger="events" type="Warning" object={"kind":"Work","namespace":"karmada-es-nj-dev","name":"test-dev-identity-5d869594f6","uid":"b221550d-c9c2-41c0-9379-6b47b2cfe654","apiVersion":"work.karmada.io/v1alpha1","resourceVersion":"24978880"} reason="SyncFailed"
I0205 01:35:29.753077       1 objectwatcher.go:173] Updated the resource(kind=GenericIdentity, test-namespace/test-dev-identity) on cluster(tt-dev).
I0205 01:35:29.753429       1 recorder.go:104] "Successfully applied resource(test-namespace/test-dev-identity) to cluster tt-dev" logger="events" type="Normal" object={"kind":"GenericIdentity","namespace":"test-namespace","name":"test-dev-identity","uid":"c40a52e2-3551-4472-8397-ef8e38e10eb2","apiVersion":"identity.com/v1"} reason="SyncSucceed"
I0205 01:35:29.753526       1 recorder.go:104] "Sync work(karmada-es-tt-dev/test-dev-identity-5d869594f6) to cluster(tt-dev) successful." logger="events" type="Normal" object={"kind":"Work","namespace":"karmada-es-tt-dev","name":"test-dev-identity-5d869594f6","uid":"63f27671-27d6-498a-83b9-2cca75273aa7","apiVersion":"work.karmada.io/v1alpha1","resourceVersion":"24933879"} reason="SyncSucceed"
I0205 01:35:31.650681       1 objectwatcher.go:229] Deleted the resource(kind=GenericIdentity, test-namespace/test-dev-identity) on cluster(ny-dev).
  flink-validate-generic-identity: Generic Identity [test-dev-identity]
  flink-validate-generic-identity: Generic Identity [test-dev-identity]
I0205 01:35:31.728450       1 recorder.go:104] "Failed to create or update resource(test-namespace/ioi-quality) in member cluster(ny-dev): admission webhook \"validate.kyverno.svc-fail\" denied the request: \n\nresource FlinkDeployment/test-namespace/ioi-quality was blocked due to the following policies \n" logger="events" type="Warning" object={"kind":"FlinkDeployment","namespace":"test-namespace","name":"ioi-quality","uid":"d5aa0ed8-f90a-4b48-a30e-cf8215da36a0","apiVersion":"flink.apache.org/v1beta1"} reason="SyncFailed"
  flink-validate-generic-identity: Generic Identity [test-dev-identity]
	  flink-validate-generic-identity: Generic Identity [test-dev-identity]
I0205 01:35:31.738387       1 recorder.go:104] "Failed to sync work(karmada-es-ny-dev/ioi-quality-675b466bfd) to cluster(ny-dev), err: admission webhook \"validate.kyverno.svc-fail\" denied the request: \n\nresource FlinkDeployment/test-namespace/ioi-quality was blocked due to the following policies \n" logger="events" type="Warning" object={"kind":"Work","namespace":"karmada-es-ny-dev","name":"ioi-quality-675b466bfd","uid":"641b1694-2b13-48f2-8f42-7d928ce35212","apiVersion":"work.karmada.io/v1alpha1","resourceVersion":"24979161"} reason="SyncFailed"
I0205 01:35:31.751131       1 objectwatcher.go:229] Deleted the resource(kind=GenericIdentity, test-namespace/test-dev-identity) on cluster(nj-dev).
I0205 01:35:31.767888       1 objectwatcher.go:229] Deleted the resource(kind=GenericIdentity, test-namespace/test-dev-identity) on cluster(tt-dev).

My understanding is that even if Karmada is unable to contact one or more clusters, the worst it should do is just deschedule work on those clusters. I was not under the impression that Karmada would delete the resources from the control-plane altogether.

Did this happen because the controller-manager crashed along with the clusters using outdated kubeconfigs? So when it attempted to initialize informers for the respective clusters it was unable to do so, and it decided to clean up orphan work?

Is there a way to prevent Karmada from deleting resources that have been applied to the control-plane by itself?

What you expected to happen:

Karmada should ideally never delete user-applied resources without the user explicitly deleting the resources themselves. Is there a configuration that we could set to mitigate this?

How to reproduce it (as minimally and precisely as possible):

In our QA setup, we pushed a bad image to the controller-manager and also happened to use two kubeconfigs that had outdated certs. When the controller manager was fixed and brought back up, all resources that had been scheduled on the member clusters with bad kubeconfigs were deleted.

Anything else we need to know?:

Environment:

  • K8s v1.31
  • Karmada version: v1.12.3
@mszacillo mszacillo added the kind/bug Categorizes issue or PR as related to a bug. label Feb 7, 2025
@RainbowMango
Copy link
Member

My understanding is that even if Karmada is unable to contact one or more clusters, the worst it should do is just deschedule work on those clusters. I was not under the impression that Karmada would delete the resources from the control plane altogether.

Yes, you are right, Karmada wouldn't and shouldn't delete users' resources. In this case, the test-dev-identity is created by the user, it can only be removed by users.

@RainbowMango
Copy link
Member

In our QA setup, we pushed a bad image to the controller-manager and also happened to use two kubeconfigs that had outdated certs.

Can you elaborate on what you did? How bad the image it is? Does the controller-manager go to the crash loop?

The karmada-controller-manager just accepts one config of Karmada, it won't require the kubeconfig of any member clusters directly. So, I need more info to understand what you did.

@RainbowMango
Copy link
Member

E0205 01:35:28.635126 1 detector.go:677] Failed to get object(identity.com/v1, kind=GenericIdentity, test-namespace/test-dev-identity), error: genericidentities.identity.bloomberg.com "test-dev-identity" not found

This is the first log that shows that the resource was not found from the informer cache and karmada-apiserver. Do you have any operator or something running at that time? which deletes the resource unexpectedly?

@mszacillo
Copy link
Contributor Author

Can you elaborate on what you did? How bad the image it is? Does the controller-manager go to the crash loop?

I was doing a quick rebase of our internal fork (which includes a custom FederatedResourceQuota controller, which I will discuss in the proposed design for that feature). While resolving merge conflicts I missed including one of my changes which caused the controller-manager cmd to CrashLoopBackoff.

Obviously in DEV we have a deployment pipeline which prevents these kinds of bad images from being promoted, but since this was this was QA I was less careful. Additionally we'll soon be relying on the platform @jabellard is helping build, so these types of upgrade errors should be minimized.

Do you have any operator or something running at that time? which deletes the resource unexpectedly?

None that would delete custom resources. I'll try to reproduce this and add more information.

@RainbowMango
Copy link
Member

None that would delete custom resources. I'll try to reproduce this and add more information.

OK. By the way, if you have the karmada-apiserver audit log, you can find when and by whom those resources were deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: No status
Development

No branches or pull requests

2 participants