Skip to content

fix(cua-driver): let start_session revive an idle-reaped session (recover from "session ended")#2035

Open
LaZzyMan wants to merge 1 commit into
trycua:mainfrom
LaZzyMan:fix/session-revive-after-idle
Open

fix(cua-driver): let start_session revive an idle-reaped session (recover from "session ended")#2035
LaZzyMan wants to merge 1 commit into
trycua:mainfrom
LaZzyMan:fix/session-revive-after-idle

Conversation

@LaZzyMan

@LaZzyMan LaZzyMan commented Jun 26, 2026

Copy link
Copy Markdown

Problem

A caller-declared session reclaimed by the idle-TTL sweep is permanently dead, and start_session cannot bring it back — so a run that goes quiet for >5 min (the agent reasoning / writing code without calling an MCP tool) loses the driver entirely.

Seen in agent trajectories: a run does start_session{session:"X"}, works fine for ~80 tool calls (all carrying session:"X"), then idles past the TTL. After that every MCP tool — including start_session{session:"X"} itself — returns session ended; tool call ignored, with no way to recover.

Root cause

Three things compound (cua-driver-core/src/session.rs, cua-driver/src/serve.rs):

  1. evict_idle marks an idle session ended by inserting it into ENDED_SESSIONS, which is never cleared.
  2. touch_session no-ops on an ended id, so it can't re-arm one.
  3. The daemon's resurrection guard (serve.rs "call" arm) rejects any call whose session is_session_endedincluding start_session, which is itself a normal tool call routed through that same arm. So the one tool meant to (re)start a session is blocked before its invoke ever runs.

Net: once X is reaped, is_session_ended("X") is true forever and nothing flips it back.

Fix

  • session::revive_session — clears the ended tombstone + re-arms the idle-TTL clock. A superset of touch_session: equivalent for a live/new id; resumes an ended one.
  • start_session tool — calls revive_session instead of touch_session, so it can resume a reaped run.
  • serve.rs — exempts start_session from the resurrection guard in both dispatch arms so it actually runs (it is the recovery path); every other tool stays guarded so a stale in-flight call can't rebuild reaped state.
  • The guard message now tells the agent how to recover.

revive semantics: a later end_session re-fires the cleanup hooks (they key on ENDED_SESSIONS), which is correct — a fresh lifecycle for the id.

This is the core correctness fix and is independent of the TTL value: a short TTL just triggers reaping more often — with revival that's recoverable instead of fatal. Whether to also raise the 300 s default (CUA_DRIVER_RS_SESSION_IDLE_TTL_SECS) is a separate tuning call, intentionally left out.

Tests

New revive_resumes_an_ended_session proves an ended id is usable again after revive; the sweep-using session tests are serialized so a cross-test zero-TTL reap can't flake them.

Verification status

  • cargo test -p cua-driver-core green (101 passed, incl. the new revive test).
  • cargo check -p cua-driver green.
  • ⚠️ No end-to-end daemon test: the cua-driver binary doesn't link on my machine (unrelated platform-macos Swift/Metal interop vs the local SDK), and I can't re-run the eval that surfaced this. The core logic is unit-tested and the serve.rs guard exemption is a one-line condition. A daemon-level smoke (reap a session → start_session → confirm subsequent calls work) would be a good addition on infra that can build the binary.

Summary by CodeRabbit

  • New Features

    • You can now resume a previously ended or idle-reclaimed session by calling start_session again with the same session ID.
    • Session recovery now works even after a session has been marked ended, instead of being permanently unavailable.
  • Bug Fixes

    • Improved handling of ended sessions so non-resume tool calls return clearer guidance on how to continue.
    • Fixed session lifecycle behavior so revived sessions can be used normally again and ended afterward.

A caller-declared session reclaimed by the idle-TTL sweep (the agent went
quiet >5min while reasoning / writing code) was permanently dead: ENDED_SESSIONS
is never cleared, touch_session no-ops on an ended id, and the daemon's
resurrection guard rejected every subsequent call — including start_session
itself, which routes through the same "call" arm. Once a run idled out, every
MCP tool returned "session ended; tool call ignored" with no way back.

- session::revive_session: clear the ended tombstone + re-arm the idle-TTL
  clock so an explicit start_session resumes the run (a superset of
  touch_session: no-op-equivalent for a live/new id).
- start_session tool: call revive_session instead of touch_session.
- serve.rs: exempt start_session from the resurrection guard in both dispatch
  arms so it actually runs (it is the recovery path); other tools stay guarded.
- the guard message now tells the agent how to recover.

revive is unit-tested (revive_resumes_an_ended_session); the sweep-using
session tests are serialized so a cross-test zero-TTL reap can't re-end a
freshly revived id. cargo test -p cua-driver-core green; cargo check
-p cua-driver green.
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@vercel

vercel Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@LaZzyMan is attempting to deploy a commit to the Cua Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds revive_session, updates start_session to call it, and changes daemon call dispatch so ended sessions can resume via start_session on Unix and Windows. Session tests now serialize TTL sweeps and cover revive/end lifecycle behavior.

Changes

Session revival flow

Layer / File(s) Summary
Session revival state
libs/cua-driver/rust/crates/cua-driver-core/src/session.rs
Adds revive_session, serializes TTL-sweep tests, and extends session lifecycle coverage for ending, reviving, sweeping, and ending again.
start_session revival
libs/cua-driver/rust/crates/cua-driver-core/src/session_tools.rs
Updates the start_session description and switches invocation from touch_session to revive_session.
Ended-session dispatch
libs/cua-driver/rust/crates/cua-driver/src/serve.rs
Allows start_session through the ended-session guard on Unix and Windows and changes the blocked-call response text for other tools.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant call_dispatch as "serve.rs call dispatch"
  participant start_tool as "StartSessionTool::invoke"
  participant revive_session as "session::revive_session"
  participant ENDED_SESSIONS
  participant SESSION_ACTIVITY

  Client->>call_dispatch: call(tool_name, effective_session=session_id)
  alt tool_name == "start_session"
    call_dispatch->>start_tool: invoke(session_id)
    start_tool->>revive_session: revive_session(session_id)
    revive_session->>ENDED_SESSIONS: remove tombstone
    revive_session->>SESSION_ACTIVITY: store fresh timestamp
    start_tool-->>call_dispatch: session resumes
    call_dispatch-->>Client: ok
  else tool_name != "start_session" and session ended
    call_dispatch-->>Client: ok + call start_session with same session
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I nibbled the tombstone, hop and cheer,
revived the session, crystal clear.
The idle clock gave one more beat,
and start_session hopped to its seat.
With a tiny thump, the run’s alive!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: allowing start_session to revive an idle-reaped session.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@libs/cua-driver/rust/crates/cua-driver-core/src/session.rs`:
- Around line 138-152: revive_session currently clears the session-layer
tombstone but leaves the overlay-layer tombstone in place, so revived sessions
still get blocked by the cursor renderer. Update revive_session in session.rs to
also clear the overlay state for the same session_id, ideally by introducing or
invoking a dedicated revive hook that removes the key from RenderState::ended in
platform-macos/src/overlay.rs. Make sure the fix aligns with the existing
end-session cleanup path so revived sessions can start rendering cursors again
without changing the trackable-session guard behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a03f76d9-2fd4-452c-a837-f2286fe6501b

📥 Commits

Reviewing files that changed from the base of the PR and between e08574c and ad8cbb4.

📒 Files selected for processing (3)
  • libs/cua-driver/rust/crates/cua-driver-core/src/session.rs
  • libs/cua-driver/rust/crates/cua-driver-core/src/session_tools.rs
  • libs/cua-driver/rust/crates/cua-driver/src/serve.rs

Comment on lines +138 to +152
pub fn revive_session(session_id: &str) {
if !is_trackable(session_id) {
return;
}
// Clear the permanent tombstone so `is_session_ended` reads false again and
// the daemon's resurrection guard stops rejecting calls for this id. A
// later `end_session` re-fires the cleanup hooks (fire_session_end is keyed
// on this set), which is correct — it's a fresh lifecycle for the id.
ended_sessions().lock().unwrap().remove(session_id);
// Re-arm the idle-TTL clock so the revived session is tracked again.
activity()
.lock()
.unwrap()
.insert(session_id.to_owned(), Instant::now());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 1. Find revive_session and any hooks it fires.
rg -nP -C3 '\bfn revive_session\b' libs/cua-driver/rust/crates

# 2. Show how the overlay `ended` tombstone is mutated — is it ever removed from?
fd -t f overlay.rs libs/cua-driver/rust/crates/platform-macos \
  --exec rg -nP -C3 '\bended\b'

# 3. Look for any revive/resume hook registration that could clear overlay state.
rg -nP -C2 'revive|resume_session|register_session_(begin|revive)_hook' \
  libs/cua-driver/rust/crates

Repository: trycua/cua

Length of output: 10236


revive_session clears the session tombstone but not the overlay-layer cursor tombstone — revived sessions will fail to render cursors.

The cursor overlay (platform-macos/src/overlay.rs) maintains a separate ended: HashSet<CursorKey> that permanently blocks command processing for tombstoned keys. As seen in lines 143–145 and 255–257 of overlay.rs, any Cmd or seed_start for a key in this set is silently dropped to prevent "ghost-cursor resurrection."

revive_session in session.rs (lines 138–152) correctly clears the session-layer tombstone (ended_sessions) but fails to clear the overlay-layer tombstone. Consequently:

  1. The daemon accepts tool calls for the revived session.
  2. The overlay guard continues to drop all cursor commands for that session_id.
  3. The cursor never reappears, violating the "RESUMES the session" contract in session_tools.rs (line 79).

Add a call to clear the overlay tombstone (e.g., via a fire_session_revive hook that removes the session_id from RenderState::ended) to ensure the render layer respects the session lifecycle reset.

Relevant code sections

session.rs:

pub fn revive_session(session_id: &str) {
    if !is_trackable(session_id) {
        return;
    }
    // ...
    ended_sessions().lock().unwrap().remove(session_id); // Clears session tombstone
    // MISSING: Clear overlay tombstone here
    activity()
        .lock()
        .unwrap()
        .insert(session_id.to_owned(), Instant::now());
}

overlay.rs (guard logic):

// Line 143-145
if map.ended.contains(&key) {
    return None; // Drops command silently
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/cua-driver-core/src/session.rs` around lines 138
- 152, revive_session currently clears the session-layer tombstone but leaves
the overlay-layer tombstone in place, so revived sessions still get blocked by
the cursor renderer. Update revive_session in session.rs to also clear the
overlay state for the same session_id, ideally by introducing or invoking a
dedicated revive hook that removes the key from RenderState::ended in
platform-macos/src/overlay.rs. Make sure the fix aligns with the existing
end-session cleanup path so revived sessions can start rendering cursors again
without changing the trackable-session guard behavior.

LaZzyMan added a commit to QwenLM/qwen-code that referenced this pull request Jun 26, 2026
Ports the fix from upstream trycua/cua#2035 into the vendored driver.

When a session is reaped for idleness, a subsequent start_session with the
same id failed instead of resuming it. Revive the ended session in place so
the agent can continue rather than getting a hard error.
shenyankm pushed a commit to shenyankm/qwen-code that referenced this pull request Jun 26, 2026
…coordinates (QwenLM#5896)

* feat(cua-driver): vendor trycua/cua driver with 1000-normalized coordinate support

Vendor libs/cua-driver from trycua/cua into packages/cua-driver as the
basis for qwen-code's computer-use backend, adding an opt-in relative
(1000x1000 normalized) coordinate mode for Qwen-VL clients.

- coord_norm.rs: 0-1000 <-> pixel conversion, per-(pid,window_id) size
  cache, tools/list description rewrite (TDD, 27 tests)
- ToolRegistry: normalized field + invoke input/output hooks
- protocol.rs: system-instruction coordinate wording switched by mode
- serve.rs: daemon list path description rewrite (input_schema aware)
- main.rs: CUA_DRIVER_RS_COORDINATE_SPACE env seed

Default coordinate_space=pixels => zero behavior change for existing
pixel clients. Set CUA_DRIVER_RS_COORDINATE_SPACE=normalized_1000 to
enable. Excludes rust/target build output.

* feat(cua-driver): make normalized coordinate scale configurable

Add CUA_DRIVER_RS_COORDINATE_SCALE (default 1000) so the normalization
full-scale can absorb the Qwen 999-vs-1000 cookbook ambiguity without a
recompile. norm_to_px/px_to_norm now take an explicit scale; denormalize_args
reads the process-wide COORDINATE_SCALE seeded once at startup from env.

* ci(cua-driver): add cross-platform release workflow for vendored driver

Standalone GitHub Action that builds, signs, and releases the vendored
cua-driver under packages/cua-driver. Adapted from upstream trycua/cua
cd-rust-cua-driver.yml:

- macOS: universal binary (lipo arm64+x86_64), codesigned + notarized into
  CuaDriver.app using qwen-code's existing secrets (MAC_CSC_LINK cert +
  App Store Connect API key notarization); Developer ID identity is
  auto-discovered from the imported cert.
- Linux: x86_64 + arm64, built in debian:11 for a glibc 2.31 floor.
- Windows: x86_64 + arm64, unsigned (no EV cert, matches upstream).
- Release: softprops/action-gh-release on cua-driver-rs-v* tags or manual
  dispatch, prerelease.

Triggered by tag push (cua-driver-rs-v*) or workflow_dispatch.

* chore(cua-driver): rebrand vendored driver as qwen-cua-driver

Rename the vendored trycua/cua driver so the fork installs and runs
independently of any upstream trycua install:
- binary cua-driver -> qwen-cua-driver
- bundle CuaDriver.app -> QwenCuaDriver.app
- bundle id com.trycua.driver -> com.qwencode.cua-driver

Updates the cargo/uia manifests, Info.plist, bundle/proxy launch paths,
permission/health-report wording, the install/build scripts, and the
cross-platform release workflow.

* feat(cua-driver): finish relative-coordinate mode — toggle, scale, zoom/move_cursor

- CUA_DRIVER_RS_COORDINATE_SPACE is now a 1/0 toggle (via is_env_truthy);
  default off keeps pixel mode byte-identical to upstream.
- Thread CUA_DRIVER_RS_COORDINATE_SCALE through every coordinate surface
  (was hardcoded 1000): input denormalization already used it; now the
  rewritten screenshot dims, the tool/param descriptions, and the agent
  instructions track the configured scale too.
- Normalize zoom (window basis) and move_cursor (screen basis) inputs and
  rewrite their descriptions, alongside click/double_click/right_click/drag.
- Fix zoom on downscaled (Retina) windows: apply the get_window_state resize
  ratio so the crop lands on the region the agent saw. Normalized mode only;
  pixel-mode zoom unchanged.

All coordinate behavior stays gated on the normalized flag, so the default
(pixels) path is unchanged from upstream.

* chore(cua-driver): add upstream-sync script (git subtree unusable here)

`git subtree split --prefix=libs/cua-driver` hangs on a commit deep in
trycua/cua's history, so the subtree add/pull workflow isn't usable for
the vendored driver (and a pull would re-split + re-hang every time).

Add scripts/sync-from-upstream.sh instead: it git-diffs two upstream refs
(never walks the full history, so it dodges the hang), reprefixes the
libs/cua-driver delta to packages/cua-driver, and `git apply --reject`s it
on top of our local changes — conflicts land as *.rej for manual fixup.
Record the vendored version in .vendored-from and document the migration +
sync method in the design doc.

* chore(cua-driver): exclude vendored driver from qwen-code ESLint

The vendored packages/cua-driver tree carries upstream JS (e.g. the
test-harness Electron app) that doesn't follow qwen-code's lint rules and
fails CI. It is not a workspace package (no package.json) and is not
qwen-code TypeScript, so add it to eslint.config.js global ignores —
alongside packages/desktop/** — the standard treatment for vendored code.

* fix(cua-driver): let start_session revive an idle-reaped session

Ports the fix from upstream trycua/cua#2035 into the vendored driver.

When a session is reaped for idleness, a subsequent start_session with the
same id failed instead of resuming it. Revive the ended session in place so
the agent can continue rather than getting a hard error.

* fix(cua-driver): retry daemon socket writes on EAGAIN

Ports the fix from upstream trycua/cua#2036 into the vendored driver.

A non-blocking daemon socket can return EAGAIN/EWOULDBLOCK mid-write when the
peer's receive buffer is momentarily full. The driver treated that as fatal
and dropped the connection. Add a bounded retry/poll loop (mirror of the
read-side socket_io helper) so transient back-pressure no longer kills the
session; only a real timeout or hard error fails the write.

* fix(cua-driver/linux): stop reporting bare "Clicked" for X11 synthetic clicks

Ports the fix from upstream trycua/cua#2025 into the vendored driver.

On X11, clicks are delivered via XSendEvent synthetic events, which many
toolkits (GTK/SDL/Allegro) ignore because send_event is set. The driver still
reported a flat success ("Clicked"), masking that nothing happened. Report
the synthetic-delivery caveat honestly so the agent can fall back instead of
assuming the click landed.

(platform-linux crate is not built on macOS; verified by clean upstream apply
and covered by upstream + release-workflow Linux CI.)

* fix(cua-driver/windows): list empty-/null-title top-level windows

Ports the fix from upstream trycua/cua#2021 into the vendored driver.

list_windows filtered out any top-level window whose title was empty or null,
so legitimate targets (splash screens, some Electron/game windows, tool
windows) were invisible to the agent and unclickable. Include empty-title
windows, using class name / process as a fallback label.

(platform-windows crate is not built on macOS; verified by clean upstream
apply and covered by upstream + release-workflow Windows CI.)

* chore(cua-driver): track cherry-picked upstream PRs; fix vendored-from

The vendored copy is actually at cua-driver-rs-v0.6.7 (workspace version and
all 0.6.7->0.6.8 delta files confirm it), but .vendored-from had drifted to
0.6.8 during an earlier sync-script trial whose code delta was not kept. Left
as-is it would make a future sync diff 0.6.8->newer and silently skip the real
0.6.7->0.6.8 fixes. Correct it back to 0.6.7.

Also record the four not-yet-merged upstream PRs we carry as cherry-picks
(trycua/cua#2021/QwenLM#2025/QwenLM#2035/QwenLM#2036) in .vendored-patches.md, and have
sync-from-upstream.sh point at it so the next sync reconciles them.

* ci(cua-driver): satisfy repo yamllint on the release workflow

The vendored-driver release workflow tripped 114 quoted-strings violations
under the repo's .yamllint (quote-type: single, required). Single-quote all
string scalars to match every other workflow in .github/workflows.

While reformatting, the release-notes body also got its paragraph blank lines
collapsed and still referenced the old CUA_DRIVER_RS_COORDINATE_SPACE=
normalized_1000 value — restore the blank lines and update it to the current
0/1 toggle (default 0 = off; optional CUA_DRIVER_RS_COORDINATE_SCALE=1000).

* chore(cua-driver): sync vendored driver to cua-driver-rs-v0.6.8

First real run of scripts/sync-from-upstream.sh: it 3-way-applied the upstream
0.6.7->0.6.8 delta onto our local fork. 10/12 files applied cleanly; the 2
rejects (install.ps1, _install-rust.sh) were already-applied baked-version
bumps (0.6.6->0.6.7, our copies were already at 0.6.7), i.e. no real conflict.

0.6.8 brings: Wayland input path (platform-linux), linux health_report +
overlay tweaks, a platform-macos build.rs step, and dependency bumps. Version
moved to 0.6.8 across the workspace.

Verified our work survived the sync untouched: the relative-coordinate shim
(coord_norm/protocol) and all four cherry-picked PRs (socket_io/session +
linux/windows) are intact — in particular the 0.6.8 edit to platform-linux
tools/impl_.rs landed alongside our QwenLM#2025 change with no collision. macOS
cargo check + 132 core tests green. (platform-linux/windows + the binary
integration test build only on their own runners; upstream CI covers those.)

* ci(cua-driver): add a dry_run gate to the release workflow

Mirror the desktop-release / release dry-run pattern: a workflow_dispatch
dry_run boolean input (default true). The cross-platform build + package jobs
always run and upload their artifacts; the GitHub Release job now publishes
only on a tag push or an explicit dry_run=false dispatch.

Lets us rehearse the whole build/package pipeline (dry_run=true, notarize=false)
and inspect the produced artifacts without cutting a release. A branch push
(no tag, not a dispatch) likewise builds without releasing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant