Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better control of database read replicas #2874

Closed
dabeeeenster opened this issue Oct 23, 2023 · 2 comments
Closed

Better control of database read replicas #2874

dabeeeenster opened this issue Oct 23, 2023 · 2 comments
Assignees
Labels
api Issue related to the REST API improvement Improvement to the existing platform

Comments

@dabeeeenster
Copy link
Contributor

  • Add the ability to specify read database regions for failover or performance.
  • I want to be able to set the order that Flagsmith uses for reads. Right now the code randomly picks a replica which isn’t super helpful.
@dabeeeenster dabeeeenster added improvement Improvement to the existing platform api Issue related to the REST API labels Nov 3, 2023
@zachaysan
Copy link
Contributor

I did a review online of common approaches and here are some of my findings:

  1. Cross regional database reads are rare. Most of the documentation focusses on local replicas.
  2. There are two main approaches for failover from one replica to another. The more prevalent way of approaching it is using a heartbeat connection to the replica in question (either before handing off to Django or by storing it locally as a cache that's refreshed once a second) and if the replica is still online, hand off the replica to the reader otherwise hand it off to the next available replica (or even the default database if there is none). The secondary approach uses middleware to intercept database queries, this approach looked inferior to the first solution.
  3. Falling back from the primary database to a promoted read replica is not widely covered by online sources. I think it should be possible but we would be venturing into the unknown and would come with unknowns like WAL log non-replication between the primary and the secondary databases.

Given what I've read online so far, I think we can pretty easily handle this ticket the following way:

  1. Implement a heartbeat connection to our replicas using a Django cache to avoid querying every cycle.
  2. Create a secondary set of CROSS_REGION_REPLICA_DATABASE_URLS in settings.py. These would not be used unless the local regional replication databases have fallen over.
  3. Keep the current REPLICA_DATABASE_URLS in settings.py which are the first line of querying for reads.
  4. In order to support setting orders instead of randomly distributed the reads I suggest a new settings.py variable called REPLICA_READ_STRATEGY set to either DISTRIBUTED which is the current approach of spreading reads across replicas, or set to SEQUENTIAL which would try, sequentially, the REPLICA_DATABASE_URLS in order and then fallback to CROSS_REGION_REPLICA_DATABASE_URLS and follow them in order once they're all exhausted.

I'm not sure if we should complicate it more than that, but one of the downsides of this approach is that if the REPLICA_DATABASE_URLS have suffered loss to the point where only a single replica is remaining, the load may be high on it even though the CROSS_REGION_REPLICA_DATABASE_URLS may be on standby. We could consider introducing another new variable that specifies a minimum distributed replica pool, which could mix the two pools if necessary, but I doubt this strategy is really necessary.

@zachaysan
Copy link
Contributor

Solved in #3300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Issue related to the REST API improvement Improvement to the existing platform
Projects
None yet
Development

No branches or pull requests

2 participants