Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

message: account for duplicates across a recovered checkpoint boundary #406

Merged
merged 1 commit into from
Oct 17, 2024

Conversation

jgraettinger
Copy link
Contributor

@jgraettinger jgraettinger commented Oct 17, 2024

Sequencer's runtime assertion checks were slightly too strong. A scenario we encountered which tripped a runtime panic was:

  • CONTINUE_TXN messages were read from a journal, and were then part of a checkpoint. The checkpoint included the producer's earlier last ACK clock and a begin offset at the first of these messages.

  • The messages were then duplicated within the journal, and ACKed.

  • A new Sequencer recovers from the checkpoint.

  • It reads the later, duplicated messages and adds them to the ring, followed by an ACK which begins a dequeue with a replay-read of the earlier messages.

  • The earlier messages are dequeued first, and then the replay ends and hands off to the ring.

  • Sequencer blows up because the first ring message is not strictly larger than the largest Clock dequeued from the replay read.

The fix is simply to remove the runtime assertion, and discard duplicates, as this is valid thing that can happen. Also add a new test case covering this scenario.


This change is Reviewable

Sequencer's runtime assertion checks were slightly too strong.
A scenario we encountered which tripped a runtime panic was:

- CONTINUE_TXN messages were read from a journal, and were then part of
  a checkpoint. The checkpoint included the producer's earlier last ACK
  clock and a begin offset at the first of these messages.

- The messages were then duplicated within the journal, and ACKed.

- A new Sequencer recovers from the checkpoint.

- It reads the later, duplicated messages and adds them to the ring,
  followed by an ACK which begins a dequeue with a replay-read of the
  earlier messages.

- The earlier messages are dequeued first, and then the replay ends and
  hands off to the ring.

- Sequencer blows up because the first ring message is not strictly
  larger than the largest Clock dequeued from the replay read.

The fix is simply to remove the runtime assertion, and discard
duplicates, as this is valid thing that can happen. Also add a new test
case covering this scenario.
@jgraettinger jgraettinger requested a review from psFried October 17, 2024 03:12
Copy link
Contributor

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jgraettinger jgraettinger merged commit 719adef into master Oct 17, 2024
1 check passed
@jgraettinger jgraettinger deleted the johnny/dups-across-checkpoints branch October 17, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants