fix(cua-driver): let start_session revive an idle-reaped session (recover from "session ended")#2035
fix(cua-driver): let start_session revive an idle-reaped session (recover from "session ended")#2035LaZzyMan wants to merge 1 commit into
Conversation
A caller-declared session reclaimed by the idle-TTL sweep (the agent went quiet >5min while reasoning / writing code) was permanently dead: ENDED_SESSIONS is never cleared, touch_session no-ops on an ended id, and the daemon's resurrection guard rejected every subsequent call — including start_session itself, which routes through the same "call" arm. Once a run idled out, every MCP tool returned "session ended; tool call ignored" with no way back. - session::revive_session: clear the ended tombstone + re-arm the idle-TTL clock so an explicit start_session resumes the run (a superset of touch_session: no-op-equivalent for a live/new id). - start_session tool: call revive_session instead of touch_session. - serve.rs: exempt start_session from the resurrection guard in both dispatch arms so it actually runs (it is the recovery path); other tools stay guarded. - the guard message now tells the agent how to recover. revive is unit-tested (revive_resumes_an_ended_session); the sweep-using session tests are serialized so a cross-test zero-TTL reap can't re-end a freshly revived id. cargo test -p cua-driver-core green; cargo check -p cua-driver green.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@LaZzyMan is attempting to deploy a commit to the Cua Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThe PR adds ChangesSession revival flow
Sequence Diagram(s)sequenceDiagram
participant Client
participant call_dispatch as "serve.rs call dispatch"
participant start_tool as "StartSessionTool::invoke"
participant revive_session as "session::revive_session"
participant ENDED_SESSIONS
participant SESSION_ACTIVITY
Client->>call_dispatch: call(tool_name, effective_session=session_id)
alt tool_name == "start_session"
call_dispatch->>start_tool: invoke(session_id)
start_tool->>revive_session: revive_session(session_id)
revive_session->>ENDED_SESSIONS: remove tombstone
revive_session->>SESSION_ACTIVITY: store fresh timestamp
start_tool-->>call_dispatch: session resumes
call_dispatch-->>Client: ok
else tool_name != "start_session" and session ended
call_dispatch-->>Client: ok + call start_session with same session
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@libs/cua-driver/rust/crates/cua-driver-core/src/session.rs`:
- Around line 138-152: revive_session currently clears the session-layer
tombstone but leaves the overlay-layer tombstone in place, so revived sessions
still get blocked by the cursor renderer. Update revive_session in session.rs to
also clear the overlay state for the same session_id, ideally by introducing or
invoking a dedicated revive hook that removes the key from RenderState::ended in
platform-macos/src/overlay.rs. Make sure the fix aligns with the existing
end-session cleanup path so revived sessions can start rendering cursors again
without changing the trackable-session guard behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a03f76d9-2fd4-452c-a837-f2286fe6501b
📒 Files selected for processing (3)
libs/cua-driver/rust/crates/cua-driver-core/src/session.rslibs/cua-driver/rust/crates/cua-driver-core/src/session_tools.rslibs/cua-driver/rust/crates/cua-driver/src/serve.rs
| pub fn revive_session(session_id: &str) { | ||
| if !is_trackable(session_id) { | ||
| return; | ||
| } | ||
| // Clear the permanent tombstone so `is_session_ended` reads false again and | ||
| // the daemon's resurrection guard stops rejecting calls for this id. A | ||
| // later `end_session` re-fires the cleanup hooks (fire_session_end is keyed | ||
| // on this set), which is correct — it's a fresh lifecycle for the id. | ||
| ended_sessions().lock().unwrap().remove(session_id); | ||
| // Re-arm the idle-TTL clock so the revived session is tracked again. | ||
| activity() | ||
| .lock() | ||
| .unwrap() | ||
| .insert(session_id.to_owned(), Instant::now()); | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 1. Find revive_session and any hooks it fires.
rg -nP -C3 '\bfn revive_session\b' libs/cua-driver/rust/crates
# 2. Show how the overlay `ended` tombstone is mutated — is it ever removed from?
fd -t f overlay.rs libs/cua-driver/rust/crates/platform-macos \
--exec rg -nP -C3 '\bended\b'
# 3. Look for any revive/resume hook registration that could clear overlay state.
rg -nP -C2 'revive|resume_session|register_session_(begin|revive)_hook' \
libs/cua-driver/rust/cratesRepository: trycua/cua
Length of output: 10236
revive_session clears the session tombstone but not the overlay-layer cursor tombstone — revived sessions will fail to render cursors.
The cursor overlay (platform-macos/src/overlay.rs) maintains a separate ended: HashSet<CursorKey> that permanently blocks command processing for tombstoned keys. As seen in lines 143–145 and 255–257 of overlay.rs, any Cmd or seed_start for a key in this set is silently dropped to prevent "ghost-cursor resurrection."
revive_session in session.rs (lines 138–152) correctly clears the session-layer tombstone (ended_sessions) but fails to clear the overlay-layer tombstone. Consequently:
- The daemon accepts tool calls for the revived session.
- The overlay guard continues to drop all cursor commands for that
session_id. - The cursor never reappears, violating the "RESUMES the session" contract in
session_tools.rs(line 79).
Add a call to clear the overlay tombstone (e.g., via a fire_session_revive hook that removes the session_id from RenderState::ended) to ensure the render layer respects the session lifecycle reset.
Relevant code sections
session.rs:
pub fn revive_session(session_id: &str) {
if !is_trackable(session_id) {
return;
}
// ...
ended_sessions().lock().unwrap().remove(session_id); // Clears session tombstone
// MISSING: Clear overlay tombstone here
activity()
.lock()
.unwrap()
.insert(session_id.to_owned(), Instant::now());
}overlay.rs (guard logic):
// Line 143-145
if map.ended.contains(&key) {
return None; // Drops command silently
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@libs/cua-driver/rust/crates/cua-driver-core/src/session.rs` around lines 138
- 152, revive_session currently clears the session-layer tombstone but leaves
the overlay-layer tombstone in place, so revived sessions still get blocked by
the cursor renderer. Update revive_session in session.rs to also clear the
overlay state for the same session_id, ideally by introducing or invoking a
dedicated revive hook that removes the key from RenderState::ended in
platform-macos/src/overlay.rs. Make sure the fix aligns with the existing
end-session cleanup path so revived sessions can start rendering cursors again
without changing the trackable-session guard behavior.
Ports the fix from upstream trycua/cua#2035 into the vendored driver. When a session is reaped for idleness, a subsequent start_session with the same id failed instead of resuming it. Revive the ended session in place so the agent can continue rather than getting a hard error.
…coordinates (QwenLM#5896) * feat(cua-driver): vendor trycua/cua driver with 1000-normalized coordinate support Vendor libs/cua-driver from trycua/cua into packages/cua-driver as the basis for qwen-code's computer-use backend, adding an opt-in relative (1000x1000 normalized) coordinate mode for Qwen-VL clients. - coord_norm.rs: 0-1000 <-> pixel conversion, per-(pid,window_id) size cache, tools/list description rewrite (TDD, 27 tests) - ToolRegistry: normalized field + invoke input/output hooks - protocol.rs: system-instruction coordinate wording switched by mode - serve.rs: daemon list path description rewrite (input_schema aware) - main.rs: CUA_DRIVER_RS_COORDINATE_SPACE env seed Default coordinate_space=pixels => zero behavior change for existing pixel clients. Set CUA_DRIVER_RS_COORDINATE_SPACE=normalized_1000 to enable. Excludes rust/target build output. * feat(cua-driver): make normalized coordinate scale configurable Add CUA_DRIVER_RS_COORDINATE_SCALE (default 1000) so the normalization full-scale can absorb the Qwen 999-vs-1000 cookbook ambiguity without a recompile. norm_to_px/px_to_norm now take an explicit scale; denormalize_args reads the process-wide COORDINATE_SCALE seeded once at startup from env. * ci(cua-driver): add cross-platform release workflow for vendored driver Standalone GitHub Action that builds, signs, and releases the vendored cua-driver under packages/cua-driver. Adapted from upstream trycua/cua cd-rust-cua-driver.yml: - macOS: universal binary (lipo arm64+x86_64), codesigned + notarized into CuaDriver.app using qwen-code's existing secrets (MAC_CSC_LINK cert + App Store Connect API key notarization); Developer ID identity is auto-discovered from the imported cert. - Linux: x86_64 + arm64, built in debian:11 for a glibc 2.31 floor. - Windows: x86_64 + arm64, unsigned (no EV cert, matches upstream). - Release: softprops/action-gh-release on cua-driver-rs-v* tags or manual dispatch, prerelease. Triggered by tag push (cua-driver-rs-v*) or workflow_dispatch. * chore(cua-driver): rebrand vendored driver as qwen-cua-driver Rename the vendored trycua/cua driver so the fork installs and runs independently of any upstream trycua install: - binary cua-driver -> qwen-cua-driver - bundle CuaDriver.app -> QwenCuaDriver.app - bundle id com.trycua.driver -> com.qwencode.cua-driver Updates the cargo/uia manifests, Info.plist, bundle/proxy launch paths, permission/health-report wording, the install/build scripts, and the cross-platform release workflow. * feat(cua-driver): finish relative-coordinate mode — toggle, scale, zoom/move_cursor - CUA_DRIVER_RS_COORDINATE_SPACE is now a 1/0 toggle (via is_env_truthy); default off keeps pixel mode byte-identical to upstream. - Thread CUA_DRIVER_RS_COORDINATE_SCALE through every coordinate surface (was hardcoded 1000): input denormalization already used it; now the rewritten screenshot dims, the tool/param descriptions, and the agent instructions track the configured scale too. - Normalize zoom (window basis) and move_cursor (screen basis) inputs and rewrite their descriptions, alongside click/double_click/right_click/drag. - Fix zoom on downscaled (Retina) windows: apply the get_window_state resize ratio so the crop lands on the region the agent saw. Normalized mode only; pixel-mode zoom unchanged. All coordinate behavior stays gated on the normalized flag, so the default (pixels) path is unchanged from upstream. * chore(cua-driver): add upstream-sync script (git subtree unusable here) `git subtree split --prefix=libs/cua-driver` hangs on a commit deep in trycua/cua's history, so the subtree add/pull workflow isn't usable for the vendored driver (and a pull would re-split + re-hang every time). Add scripts/sync-from-upstream.sh instead: it git-diffs two upstream refs (never walks the full history, so it dodges the hang), reprefixes the libs/cua-driver delta to packages/cua-driver, and `git apply --reject`s it on top of our local changes — conflicts land as *.rej for manual fixup. Record the vendored version in .vendored-from and document the migration + sync method in the design doc. * chore(cua-driver): exclude vendored driver from qwen-code ESLint The vendored packages/cua-driver tree carries upstream JS (e.g. the test-harness Electron app) that doesn't follow qwen-code's lint rules and fails CI. It is not a workspace package (no package.json) and is not qwen-code TypeScript, so add it to eslint.config.js global ignores — alongside packages/desktop/** — the standard treatment for vendored code. * fix(cua-driver): let start_session revive an idle-reaped session Ports the fix from upstream trycua/cua#2035 into the vendored driver. When a session is reaped for idleness, a subsequent start_session with the same id failed instead of resuming it. Revive the ended session in place so the agent can continue rather than getting a hard error. * fix(cua-driver): retry daemon socket writes on EAGAIN Ports the fix from upstream trycua/cua#2036 into the vendored driver. A non-blocking daemon socket can return EAGAIN/EWOULDBLOCK mid-write when the peer's receive buffer is momentarily full. The driver treated that as fatal and dropped the connection. Add a bounded retry/poll loop (mirror of the read-side socket_io helper) so transient back-pressure no longer kills the session; only a real timeout or hard error fails the write. * fix(cua-driver/linux): stop reporting bare "Clicked" for X11 synthetic clicks Ports the fix from upstream trycua/cua#2025 into the vendored driver. On X11, clicks are delivered via XSendEvent synthetic events, which many toolkits (GTK/SDL/Allegro) ignore because send_event is set. The driver still reported a flat success ("Clicked"), masking that nothing happened. Report the synthetic-delivery caveat honestly so the agent can fall back instead of assuming the click landed. (platform-linux crate is not built on macOS; verified by clean upstream apply and covered by upstream + release-workflow Linux CI.) * fix(cua-driver/windows): list empty-/null-title top-level windows Ports the fix from upstream trycua/cua#2021 into the vendored driver. list_windows filtered out any top-level window whose title was empty or null, so legitimate targets (splash screens, some Electron/game windows, tool windows) were invisible to the agent and unclickable. Include empty-title windows, using class name / process as a fallback label. (platform-windows crate is not built on macOS; verified by clean upstream apply and covered by upstream + release-workflow Windows CI.) * chore(cua-driver): track cherry-picked upstream PRs; fix vendored-from The vendored copy is actually at cua-driver-rs-v0.6.7 (workspace version and all 0.6.7->0.6.8 delta files confirm it), but .vendored-from had drifted to 0.6.8 during an earlier sync-script trial whose code delta was not kept. Left as-is it would make a future sync diff 0.6.8->newer and silently skip the real 0.6.7->0.6.8 fixes. Correct it back to 0.6.7. Also record the four not-yet-merged upstream PRs we carry as cherry-picks (trycua/cua#2021/QwenLM#2025/QwenLM#2035/QwenLM#2036) in .vendored-patches.md, and have sync-from-upstream.sh point at it so the next sync reconciles them. * ci(cua-driver): satisfy repo yamllint on the release workflow The vendored-driver release workflow tripped 114 quoted-strings violations under the repo's .yamllint (quote-type: single, required). Single-quote all string scalars to match every other workflow in .github/workflows. While reformatting, the release-notes body also got its paragraph blank lines collapsed and still referenced the old CUA_DRIVER_RS_COORDINATE_SPACE= normalized_1000 value — restore the blank lines and update it to the current 0/1 toggle (default 0 = off; optional CUA_DRIVER_RS_COORDINATE_SCALE=1000). * chore(cua-driver): sync vendored driver to cua-driver-rs-v0.6.8 First real run of scripts/sync-from-upstream.sh: it 3-way-applied the upstream 0.6.7->0.6.8 delta onto our local fork. 10/12 files applied cleanly; the 2 rejects (install.ps1, _install-rust.sh) were already-applied baked-version bumps (0.6.6->0.6.7, our copies were already at 0.6.7), i.e. no real conflict. 0.6.8 brings: Wayland input path (platform-linux), linux health_report + overlay tweaks, a platform-macos build.rs step, and dependency bumps. Version moved to 0.6.8 across the workspace. Verified our work survived the sync untouched: the relative-coordinate shim (coord_norm/protocol) and all four cherry-picked PRs (socket_io/session + linux/windows) are intact — in particular the 0.6.8 edit to platform-linux tools/impl_.rs landed alongside our QwenLM#2025 change with no collision. macOS cargo check + 132 core tests green. (platform-linux/windows + the binary integration test build only on their own runners; upstream CI covers those.) * ci(cua-driver): add a dry_run gate to the release workflow Mirror the desktop-release / release dry-run pattern: a workflow_dispatch dry_run boolean input (default true). The cross-platform build + package jobs always run and upload their artifacts; the GitHub Release job now publishes only on a tag push or an explicit dry_run=false dispatch. Lets us rehearse the whole build/package pipeline (dry_run=true, notarize=false) and inspect the produced artifacts without cutting a release. A branch push (no tag, not a dispatch) likewise builds without releasing.
Problem
A caller-declared
sessionreclaimed by the idle-TTL sweep is permanently dead, andstart_sessioncannot bring it back — so a run that goes quiet for >5 min (the agent reasoning / writing code without calling an MCP tool) loses the driver entirely.Seen in agent trajectories: a run does
start_session{session:"X"}, works fine for ~80 tool calls (all carryingsession:"X"), then idles past the TTL. After that every MCP tool — includingstart_session{session:"X"}itself — returnssession ended; tool call ignored, with no way to recover.Root cause
Three things compound (
cua-driver-core/src/session.rs,cua-driver/src/serve.rs):evict_idlemarks an idle session ended by inserting it intoENDED_SESSIONS, which is never cleared.touch_sessionno-ops on an ended id, so it can't re-arm one.is_session_ended— includingstart_session, which is itself a normal tool call routed through that same arm. So the one tool meant to (re)start a session is blocked before itsinvokeever runs.Net: once
Xis reaped,is_session_ended("X")is true forever and nothing flips it back.Fix
session::revive_session— clears the ended tombstone + re-arms the idle-TTL clock. A superset oftouch_session: equivalent for a live/new id; resumes an ended one.start_sessiontool — callsrevive_sessioninstead oftouch_session, so it can resume a reaped run.start_sessionfrom the resurrection guard in both dispatch arms so it actually runs (it is the recovery path); every other tool stays guarded so a stale in-flight call can't rebuild reaped state.revivesemantics: a laterend_sessionre-fires the cleanup hooks (they key onENDED_SESSIONS), which is correct — a fresh lifecycle for the id.This is the core correctness fix and is independent of the TTL value: a short TTL just triggers reaping more often — with revival that's recoverable instead of fatal. Whether to also raise the 300 s default (
CUA_DRIVER_RS_SESSION_IDLE_TTL_SECS) is a separate tuning call, intentionally left out.Tests
New
revive_resumes_an_ended_sessionproves an ended id is usable again afterrevive; the sweep-using session tests are serialized so a cross-test zero-TTL reap can't flake them.Verification status
cargo test -p cua-driver-coregreen (101 passed, incl. the new revive test).cargo check -p cua-drivergreen.cua-driverbinary doesn't link on my machine (unrelated platform-macos Swift/Metal interop vs the local SDK), and I can't re-run the eval that surfaced this. The core logic is unit-tested and the serve.rs guard exemption is a one-line condition. A daemon-level smoke (reap a session →start_session→ confirm subsequent calls work) would be a good addition on infra that can build the binary.Summary by CodeRabbit
New Features
start_sessionagain with the same session ID.Bug Fixes