FIX: prevent hanging on server shutdown#546
Conversation
| for _, room in list(self.rooms.items()): | ||
| for client in list(room.clients): | ||
| tasks.extend(client._background_tasks) # type: ignore[attr-defined] | ||
| client._message_queue.put_nowait(b"") # type: ignore[attr-defined] |
There was a problem hiding this comment.
might be nice to have teh client have a disconnect method that just handles this. looks rather mysterious from here.
|
hmm even though this works more than before it still doesn't work all of the time. :'( draft for now |
eb0b575 to
d0ad3a4
Compare
On SIGTERM, non-daemon anyio WorkerThread instances blocked Python's threading._shutdown() because the root asyncio task was never awaited, so its done-callbacks (which send shutdown signals to worker threads) never fired. Fix: - Track the root asyncio task (server.start()) so clean() can await it - Stop all rooms explicitly during shutdown (auto_clean_rooms=False means pycrdt never stops them automatically) - Await the root task with a timeout so anyio's task group __aexit__ runs and worker threads get their shutdown signal - Cancel timed-out tasks in stop_extension() instead of ignoring them Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Found the final issue with the help of claude and the main serve task was never being awaited so there were some workthreads hanging around preventing shutdown. updated description incoming. but i think this works now! |
|
not sure who to ask for a review here. but I can confirm that this fixes it for me locally. Not really sure why the windows 3.14t test failed :/ it doesn't seem obviously related to these changes? |
davidbrochart
left a comment
There was a problem hiding this comment.
This is why structured concurrency is helpful. Jupyverse's RTC implementation is much simpler and doesn't have this bug.
| task = asyncio.create_task(self._websocket_server.start()) | ||
| self._background_tasks.add(task) | ||
| task.add_done_callback(self._background_tasks.discard) | ||
| self._websocket_server._start_task = task |
There was a problem hiding this comment.
You could have self.create_task() return the task:
| task = asyncio.create_task(self._websocket_server.start()) | |
| self._background_tasks.add(task) | |
| task.add_done_callback(self._background_tasks.discard) | |
| self._websocket_server._start_task = task | |
| self._websocket_server._start_task = self.create_task(self._websocket_server.start()) |
|
|
||
| # Now disconnect existing clients so their serve() tasks complete. | ||
| tasks: list[asyncio.Task] = [] | ||
| for _, room in list(self.rooms.items()): |
There was a problem hiding this comment.
| for _, room in list(self.rooms.items()): | |
| for room in list(self.rooms.values()): |
| # Rooms persist after client disconnect so users can reconnect without | ||
| # a full re-sync (see auto_clean_rooms=False in app.py). On shutdown | ||
| # we must stop them explicitly. | ||
| for _, room in list(self.rooms.items()): |
There was a problem hiding this comment.
| for _, room in list(self.rooms.items()): | |
| for room in list(self.rooms.values()): |
|
Thank you for the review! I cleaned up based on your suggestions. Two follow ups:
|
|
|
|
@ianhi do you think this is worth merging and releasing or is (2) a blocker? |
fixes #161 supersedes #514
🤖 full disclosure this a PR that I used claude for nearly in its entirety. I have reviewed every line to the best of my ability and confirmed that the fix works for me locally. I did multiple local rounds of review, tightening up exceptions, logic and improving comments. But ultimately I learned more about async through this PR than I knew at the start, so very possible that there is a subtlety i missed.
To debug I ended up designing a loop for claude where it used jupyter-collab-mcp to interact with the notebooks (creating rooms crucial for causing hte bug) and then sent sigterms from a shell with
PYTHONFAULTHANDLER=1and looped through possible fixes before getting it working. I then guiding claude through coming up with a minimal fix.There were three issues preventing shutdown on receiving ctrl-C
1 client serve loop never finishing
the server was stopped. but client rooms were never stopped. We needed to sent a stop message to the clients to cause their loops to stop
jupyter-collaboration/projects/jupyter-server-ydoc/jupyter_server_ydoc/websocketserver.py
Lines 65 to 75 in c750ddd
2 Rooms never being stopped
rooms were persisted after a client disconnect (I think for allowing reconnect). But on shutdown they were never being stopped, so the task was staying alive. now we explcitly stop all rooms.
This option means they weren't being autoclosed on disconnect (which I think is good)
jupyter-collaboration/projects/jupyter-server-ydoc/jupyter_server_ydoc/app.py
Lines 119 to 120 in c750ddd
3 the root task was never awaited
because we didn't hold on to a reference to the server start task there was no way to await it and ultimately trigger the worker threads were never cleaned up. now we save a reference and await it at the end of
cleanjupyter-collaboration/projects/jupyter-server-ydoc/jupyter_server_ydoc/handlers.py
Lines 82 to 84 in c750ddd