Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task processor deadlock issues #2709

Closed
matthewelwell opened this issue Aug 30, 2023 · 4 comments
Closed

Task processor deadlock issues #2709

matthewelwell opened this issue Aug 30, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@matthewelwell
Copy link
Contributor

Recently we have seen instances of deadlock errors causing the task processor queue to grow beyond normal levels.

It seems as though this is caused by a flood of requests for environments within the same project.

Next steps:

  1. Attempt to reproduce this by running the task processor locally
  2. If (1) fails, attempt to reproduce in the staging environment
  3. Investigate adding rate limits to all dashboard endpoints, scoped per token
@dabeeeenster dabeeeenster added bug Something isn't working api Issue related to the REST API labels Sep 5, 2023
@gagantrivedi gagantrivedi added task-processor and removed api Issue related to the REST API labels Sep 15, 2023
@gagantrivedi
Copy link
Member

Why do we get deadlocks:
whenever this function gets called for audit log on project level, it generates a SQL that looks something like this

UPDATE "environments_environment" SET "updated_at" = '2023-09-15T09:46:24.508997+00:00'::timestamptz WHERE ("environments_environment"."deleted_at" IS NULL AND "environments_environment"."project_id" = <>)

Here's the corrected text:

"Now, let's assume the project has two environments—1 and 2. In order to update the updated_at timestamp, it needs to lock both rows one by one, which is fine so far. The problem arises when another transaction attempts to do the same concurrently.

For example, the first transaction locks the first row, and at the same time, the second transaction locks the second row. Now, to complete the update, the first transaction will need a lock on the second row. However, it can't acquire the lock because the second transaction won't release the lock until it can lock the first row and commit. Voilà, we have a deadlock!"

gagantrivedi added a commit that referenced this issue Sep 27, 2023
reduce transaction length to make
#2709
more rare and improve wait time
@matthewelwell
Copy link
Contributor Author

Following changes by Gagan we still saw a spike in Deadlocks on 8th October at ~23:15 UTC. I have added a Cloudwatch alarm. @gagantrivedi to review task processor ECS logs.

@gagantrivedi
Copy link
Member

We only have logs till 14th October. Will have to wait for db to run into this again

@gagantrivedi
Copy link
Member

Fixed by #3339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants