Skip to content

Add flow for re-sending replay frames that spectator did not receive at end of play#38163

Draft
bdach wants to merge 1 commit into
ppy:masterfrom
bdach:resend-missing-replay-frames
Draft

Add flow for re-sending replay frames that spectator did not receive at end of play#38163
bdach wants to merge 1 commit into
ppy:masterfrom
bdach:resend-missing-replay-frames

Conversation

@bdach

@bdach bdach commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

RFC. Lightly tested full-stack with 2 PCs to simulate real drop-offs, but probably still needs some further testing.

Compatibility matrix:

old client new client
old server 🟠1 🟠1
new server 🟠2 🟢3

This implements the proposal first outlined in ppy/osu-server-spectator#244 (comment).

With this change, if a player's connection drops out during a play but then recovers before the end of the play, the server will re-request any frame bundles that it has not received thus far.

The client caches all frame bundles it sends out until the end of the play and the request for any missing frame bundles from the server. The frame bundles for past plays are purged only when the server invokes CompleteReplay(). (If the server determines that it has received all frame bundles, it will still invoke CompleteReplay(), but with an empty list of bundle sequence numbers.)

I expect this caching strategy to be controversial, so I am listening to counterproposals (LRU / limited capacity queue? timed expiry? something else?)

Footnotes

  1. Operations will succeed, but the server never attempts to retrieve any dropped frames because it is not aware of the new flow, so the resulting replay will be incomplete. 2

  2. Operations will succeed, but the server will not attempt to use the new flow to retrieve any dropped frames, because the client will not send LastFrameBundleSequenceNumber, so the resulting replay will be incomplete.

  3. Operations will succeed. The server will attempt to retrieve dropped frames in a once-off operation at the end of gameplay.

…at end of play

Compatibility matrix:

|            | old client | new client |
| :--------: | :--------: | :--------: |
| old server |   🟠[^1]   |   🟠[^1]   |
| new server |   🟠[^2]   |   🟢[^3]   |

[^1]: Operations will succeed, but the server never attempts to retrieve
any dropped frames because it is not aware of the new flow, so the
resulting replay will be incomplete.
[^2]: Operations will succeed, but the server will not attempt to use
the new flow to retrieve any dropped frames, because the client will not
send `LastFrameBundleSequenceNumber`, so the resulting replay will be
incomplete.
[^3]: Operations will succeed. The server will attempt to retrieve
dropped frames in a once-off operation at the end of gameplay.

This implements the proposal first outlined in
ppy/osu-server-spectator#244 (comment).

With this change, if a player's connection drops out during a play but
then recovers before the end of the play, the server will re-request any
frame bundles that it has not received thus far.

The client caches all frame bundles it sends out until the end of the
play and the request for any missing frame bundles from the server. The
frame bundles for past plays are purged only when the server invokes
`CompleteReplay()`. (If the server determines that it has received all
frame bundles, it will still invoke `CompleteReplay()`, but with an
empty list of bundle sequence numbers.)

I expect this caching strategy to be controversial, so I am listening to
counterproposals (LRU / limited capacity queue? something else?)
@bdach bdach requested a review from a team June 26, 2026 12:22
@bdach bdach self-assigned this Jun 26, 2026
@bdach bdach added the area:online functionality Deals with online fetching / sending but don't change much on a surface UI level. label Jun 26, 2026
@bdach bdach added area:replay type/reliability Deals with game crashing or breaking in a serious way. labels Jun 26, 2026
@bdach bdach moved this from Inbox to Pending Review in osu! team task tracker Jun 29, 2026
@peppy peppy self-requested a review June 30, 2026 09:01

@peppy peppy left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few concerns

/// <param name="FrameBundleSequenceNumbers">The sequence numbers of frame bundles that the server never received.</param>
[Serializable]
[MessagePackObject]
public record CompleteReplayRequest(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I got in the habit of doing with bancho/stable is prefixing these kinds of calls with Client or Server, because I guarantee we are going to end up with Request objects being fired in both directions sooner or later.

Thoughts?

ServerCompleteReplayRequest
ClientCompleteReplayResponse

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the fact that no other operation between client and spectator server uses this convention, I do not have any particular opinions on this proposal.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now we haven't had any classes with Request / Response pairing, so I'm not sure it would suit there. Specifically important for when Request/Response is involved IMHO, because until now the client is always the Requester when using these terms (aka web requests).

But we could potentially prefix other classes where relevant.

/// Used to determine ordering of frame bundles, and for server-side checks that server received all frame bundles it was supposed to.
/// </summary>
[Key(2)]
public long? SequenceNumber { get; set; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have ReceivedTime in FrameHeader, which ended up never being used, a bit weird.

Anyway, any reason this is in the bundle and not the header? I'm not sure what the differentiation we have as to what goes in either, but curious if you have reasoning for putting it here specifically.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have ReceivedTime in FrameHeader, which ended up never being used, a bit weird.

No idea what this is, as best I can tell it's completely dead.

Anyway, any reason this is in the bundle and not the header? I'm not sure what the differentiation we have as to what goes in either, but curious if you have reasoning for putting it here specifically.

There is no reason. Putting it on the bundle makes the server side a touch less painful because it removes one level of property scraping but that's not a real reason.

private long currentFrameBundleSequenceNumber;

private readonly Queue<FrameDataBundle> pendingFrameBundles = new Queue<FrameDataBundle>();
private readonly Dictionary<long, List<FrameDataBundle>> allFrameBundles = new Dictionary<long, List<FrameDataBundle>>();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder how big this will grow during a multi-hour marathon beatmap 🤔. Might be worth calculating on paper.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messages can be up to 32 KB in size by default. Not yet sure how this translates to actual usage, will investigate.

@bdach bdach Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back of napkin estimations are not great.

frameBundle: 459 B
maxMessageSize: 32 * 1024 B
bundlesPerSecond: 5

maxMessageSize / (frameBundle * bundlesPerSecond) = 14,28 [seconds]

Listening to suggestions what to do about it.

316 of those 459 bytes are the actual replay frames, so even dropping the frame headers does not help here much.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd chunk the responses client side to a sane number, for sure.

But also I wasn't even thinking about max message size (so good you were I guess?). Was more about client side memory usage for very long maps. But 128 kb per minute sounds not too bad.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd chunk the responses client side to a sane number, for sure.

I don't understand what this means. Please elaborate.

Half of the appeal of this solution was that this was a single-shot request that doesn't need retrying. If suddenly I have to split this single-shot request into however many, each of which can fail (what happens when any one of them does?), this is going to get very complex very quick.

I'd sooner entertain solutions like having the client send the .osr across or similar. Maybe that at least can fit into a reasonably small size.


public Task<CompleteReplayResponse> CompleteReplay(CompleteReplayRequest completeReplayRequest)
{
if (!allFrameBundles.Remove(completeReplayRequest.ScoreTokenId, out var frameBundlesForScoreToken))

@peppy peppy Jun 30, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You touched on this in the OP, but I fear that this is not enough in terms of cleanup logic.

If a user reconnects after a long time being disconnected and the server has forgotten about the user's pending replay data, it's going to remain in the client dictionary until restart, correct?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user reconnects after a long time being disconnected and the server has forgotten about the user's pending replay data, it's going to remain in the client dictionary until restart, correct?

Correct. I have no opinion on how to handle that as choosing an expiry mechanism is highly subjective.

@bdach

bdach commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

To summarise conversation that happened elsewhere regarding this PR today:

  • The flow added here is server-initiated because the server is the one with reliable information as to which frame bundles it did not receive. A client-initiated flow here does not do much, unless the possibility of sending redundant data, or worse, not sending data that should be sent, is accepted.

  • The server cannot completely know which frame bundles it did not receive until the client indicates to the server that it is done with a score. While in the middle of the play you could have the client just reconnect and send new frames, at which point the server would notice a gap in this sequence numbers, this does not work if the client drops out at the end of a play, because the server does not know where the play ends.

  • To that end this flow was deemed mostly acceptable, except for the fact of the 32KB message size allowance. The solution to that will be to split the one-shot frame re-send into multiple. For now the multiple re-sends will also be one-shot. This means that the server might still not receive a complete replay. The failure rate of the re-sends will be tracked, and if this becomes a problem, this will be iterated upon further.

  • However, before I try any of that, there's a larger problem that was previously described in Reintroduce score submission retry mechanism #24609 (comment), namely that EndPlaySession() is serializing for a single client. Because the server only keeps state for one play, currently the client would not be able to start a new play before it has attempted to re-send any and all dropped frames to the server.

    This would make the issues of many users who already complain of the serial nature of submission much worse. In locations like Russia or China bad network links are routine and already users have issues wherein they need to wait seconds for the previous play to fully submit before they can play again.

    To that end, I want to try removing the serial requirements of EndPlaySession() and allowing storing multiple scores for one user server-side. This will not be unlimited to avoid DDoS style attacks; a single user will be only allowed to have single-digit scores pending server-side (exact number TBD).

    Doing this also paves a way for potentially implementing real submission retry in the future - the only major remaining piece would be to split off API score submission requests to a separate queue so that they don't block the main one.

Blocking for now due to all of the above. @peppy please confirm that I haven't misrepresented anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:online functionality Deals with online fetching / sending but don't change much on a surface UI level. area:replay blocked/don't merge Don't merge this. size/L type/reliability Deals with game crashing or breaking in a serious way.

Projects

Status: Backburner

Development

Successfully merging this pull request may close these issues.

2 participants