Skip to content

feat(driver): add Windows element query geometry and verified actions#1993

Open
mustbearnold wants to merge 14 commits into
trycua:mainfrom
mustbearnold:pr/windows-10x-agent-runtime
Open

feat(driver): add Windows element query geometry and verified actions#1993
mustbearnold wants to merge 14 commits into
trycua:mainfrom
mustbearnold:pr/windows-10x-agent-runtime

Conversation

@mustbearnold

@mustbearnold mustbearnold commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Adds reliable Windows GUI automation surfaces that let agents address UI elements semantically and verify intended state changes instead of treating low-level OS dispatch success as task success.

Windows driver surfaces

  • Enriches get_window_state element records with stable geometry and metadata while preserving legacy fields.
  • Adds get_element_geometry for cached element-token/index geometry lookups.
  • Adds find_element for focused semantic element queries by label, role, automation id, class name, and text.
  • Adds click_verified, a verified click transaction with pre/post UIA snapshots and explicit expected-label predicates.
  • Adds set_value_verified, a verified text/value transaction that wraps set_value and verifies the requested value/label appears in post-state.
  • Adds compact verified-action state deltas (added_labels/removed_labels, added_texts/removed_texts, plus total counts) so callers get a bounded explanation of what changed without dumping the full UI tree.

Safety / reliability

  • Hardens model-supplied coordinate parsing in the Python UI-TARS loop.
  • Separates os_dispatch_success from state_changed, verified, expected_change_satisfied, and final success in verified action tools.
  • Returns structured diagnostics and error results when dispatch succeeds but the expected post-state is not observed.

Verification

Source checks on branch pr/windows-10x-agent-runtime at 705585d:

cargo check -p platform-windows                                      passed
cargo test -p platform-windows --lib                                 80 passed
cargo build -p cua-driver                                            passed

Harness/unit checks:

py -m pytest tests/test_find_geometry_smoke.py tests/test_bench_cleanup_metrics.py -q  13 passed

Runtime smoke:

py scripts/cua_driver_smoke.py --runs 2 --task all --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe

Result:

6/6 passed

Covers:

  • find_element_geometry_smoke
  • click_verified_smoke
  • set_value_verified_smoke

Post-smoke cleanup:

{"Calculator": [], "cua-mcp-bench-setv": []}

Latest raw smoke artifact:

reports/smoke/find-geometry-smoke-20260623-090217.jsonl

Additional local runtime samples also passed earlier:

py scripts/cua_driver_smoke.py --runs 5 --task all --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe  15/15 passed
py scripts/cua_driver_bench.py --runs 3 --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe             12/12 passed across Calculator, Notepad, Terminal, Explorer

Diff stat

.../rust/crates/platform-windows/src/msaa.rs       |  10 +
 .../crates/platform-windows/src/tools/impl_.rs     | 675 +++++++++++++++++++--
 .../rust/crates/platform-windows/src/uia/mod.rs    |  19 +-
 .../agent/cua_agent/loops/coordinate_parser.py     |  36 ++
 libs/python/agent/cua_agent/loops/uitars.py        |  13 +-
 .../agent/tests/test_uitars_coordinate_parser.py   |  40 ++
 6 files changed, 749 insertions(+), 44 deletions(-)

Commit stack

705585d feat(driver): summarize verified action state diffs
c26e2e8 feat(driver): add verified set value transaction
7533ad8 feat(driver): add verified click transaction
5616363 feat(driver): add find element query tool
723a2da feat(driver): expose cached element geometry
5175d3f feat(driver): enrich windows structured element geometry
3d9e5ad security: harden model-supplied coordinate parsing

Notes for reviewers

  • The verified transaction tools intentionally wrap existing primitive implementations rather than replacing them, preserving existing routing/token/cache behavior.
  • success in verified tools means os_dispatch_success && verified; primitive dispatch success alone is exposed separately.
  • Added/removed state-delta samples are capped; count fields report total unique additions/removals.
  • Smoke and benchmark harnesses use deterministic, isolated Calculator, Notepad, Terminal, and Explorer temp resources and verify cleanup.

Summary by CodeRabbit

  • New Features

    • Added verified interaction tools for UI clicks and value changes with pre/post-action state validation.
    • Added new tools for element discovery and geometry (bounds in screen and window coordinates).
  • Improvements

    • Enhanced Windows UI elements with class name plus enabled/visible/selected/focused metadata.
    • Enriched exported element records with richer identifiers, text fields, and geometry.
    • Replaced unsafe coordinate parsing with a validated parser that rejects non-literals, non-numerics, wrong shapes, and non-finite values.
  • Tests

    • Added unit tests for coordinate parsing and UI state flag derivation.

@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Someone is attempting to deploy a commit to the Cua Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0bbfce67-c127-4d29-907f-c975f6257b73

📥 Commits

Reviewing files that changed from the base of the PR and between 705585d and e78213e.

📒 Files selected for processing (4)
  • libs/cua-driver/rust/crates/platform-windows/src/msaa.rs
  • libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
  • libs/python/agent/cua_agent/loops/coordinate_parser.py
  • libs/python/agent/tests/test_uitars_coordinate_parser.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • libs/python/agent/cua_agent/loops/coordinate_parser.py
  • libs/python/agent/tests/test_uitars_coordinate_parser.py
  • libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs

📝 Walkthrough

Walkthrough

Two independent changes: (1) The Windows UIA/MSAA layer gains class_name, enabled, visible, selected, and focused fields on UiaNode; a shared structured_element_record helper replaces inline element serialization; four new tools (find_element, click_verified, get_element_geometry, set_value_verified) are added, registered, and tested. (2) A new parse_uitars_coordinates helper replaces eval() in the Python UITARS action loop with ast.literal_eval-based safe parsing, validated and tested.

Changes

Windows UIA Node Enrichment and New Tools

Layer / File(s) Summary
UiaNode model: class_name and state fields
libs/cua-driver/rust/crates/platform-windows/src/uia/mod.rs
UiaNode gains class_name, enabled, visible, selected, focused fields; UIA_ClassNamePropertyId is added to the bulk property prefetch loop; the cached UIA walker reads cached class name and populates all new fields when constructing both actionable and non-actionable nodes.
MSAA state-bit derivation and node enrichment
libs/cua-driver/rust/crates/platform-windows/src/msaa.rs
MSAA state flag constants and helper functions convert optional accState values into enabled, visible, selected, and focused booleans; state_int is extracted from acc.get_accState; both actionable and non-emitting MSAA node construction branches populate the enriched UiaNode state fields; unit tests verify state-flag derivation logic.
structured_element_record helper and get_window_state refactor
libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
Introduces shared helpers to convert UiaNode into enriched JSON element records with stable IDs, label/value/text fields, backend, depth/parent info, screen and window-relative geometry, and explicit null/error fields for missing rects; get_window_state is refactored to derive target_window_bounds, call the helper for all element records, and include capture_scope, capture_mode, and screenshot metadata fields.
New tools: find_element, click_verified, get_element_geometry, set_value_verified
libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
Four new tool structs are defined and implemented: find_element filters a bounded UIA walk and returns enriched element records; click_verified captures pre/post snapshots around a click and returns verification booleans; get_element_geometry reads cached bounds in screen and window coordinate spaces; set_value_verified diffs UIA text around a set_value call and reports verification results.
Tool registration and unit tests
libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
build_registry registers all four new tools; unit tests cover click-verified expectation transitions, UIA walk timeout behavior, screenshot metadata coordinate-space/scale, structured_element_record stable ID/token/geometry error handling, tool registry schema advertising, label diff summarization, set_value_verified text extraction, and click button enum invariants.

Safe Coordinate Parsing for UITARS

Layer / File(s) Summary
parse_uitars_coordinates: implementation, integration, and tests
libs/python/agent/cua_agent/loops/coordinate_parser.py, libs/python/agent/cua_agent/loops/uitars.py, libs/python/agent/tests/test_uitars_coordinate_parser.py
parse_uitars_coordinates uses ast.literal_eval with boolean rejection, float coercion, math.isfinite validation, and 2-to-4 value normalization; all eval() calls for coordinate parsing in uitars.py (click, double-click, right-click, scroll, drag) are replaced with this helper; parametrized tests verify accepted 2/4-element formats and rejected syntax/non-numeric/non-finite inputs.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(135, 206, 235, 0.5)
    Note over Caller,UiaWalker: click_verified flow
    Caller->>ClickVerifiedTool: invoke(element_token, expected_label_present)
    ClickVerifiedTool->>UiaWalker: pre-snapshot label extraction
    UiaWalker-->>ClickVerifiedTool: pre_labels set
    ClickVerifiedTool->>click: execute OS click dispatch
    click-->>ClickVerifiedTool: os_dispatch_success
    ClickVerifiedTool->>UiaWalker: post-snapshot label extraction
    UiaWalker-->>ClickVerifiedTool: post_labels set
    ClickVerifiedTool->>ClickVerifiedTool: compute label deltas, verify expected_label_present/absent
    ClickVerifiedTool-->>Caller: {os_dispatch_success, state_changed, verified, success}
  end
  rect rgba(144, 238, 144, 0.5)
    Note over Caller,UiaWalker: set_value_verified flow
    Caller->>SetValueVerifiedTool: invoke(value, expected_value, expected_label_present)
    SetValueVerifiedTool->>UiaWalker: pre-snapshot text extraction
    UiaWalker-->>SetValueVerifiedTool: pre_texts
    SetValueVerifiedTool->>set_value: delegate set_value call
    set_value-->>SetValueVerifiedTool: result
    SetValueVerifiedTool->>UiaWalker: post-snapshot text extraction
    UiaWalker-->>SetValueVerifiedTool: post_texts
    SetValueVerifiedTool->>SetValueVerifiedTool: diff texts, evaluate expected substrings
    SetValueVerifiedTool-->>Caller: {value_found, label_found, state_changed, success}
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 A rabbit once feared the sly eval() call,
So it parsed with literal_eval—no tricks at all!
Then MSAA nodes got class_name and visible too,
Four shiny new tools for the Windows UI crew.
Each click now comes verified, each value confirmed—
The warren is safer, the code firmly firmed! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 49.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective: adding Windows element query and geometry tools plus verified action capabilities for reliable GUI automation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@mustbearnold

Copy link
Copy Markdown
Author

Posted PR and added a more complex local benchmark slice for the 10–20x computer-use goal.

PR: #1993

New complex benchmark task added locally:

  • terminal_explorer_file_workflow: Cua Driver launches cmd.exe to create a unique sentinel file in an isolated temp folder, launches File Explorer to that generated folder, verifies both the file contents and Explorer window, then closes/removes the benchmark resources.

Latest local benchmark command:

py scripts/cua_driver_bench.py --runs 2 --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe

Result: 10/10 passed across five task classes:

{
  "calculator_basic_click": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "explorer_temp_folder": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "notepad_edit_save": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "terminal_explorer_file_workflow": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "terminal_sentinel_file": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  }
}

Cleanup verified: no Calculator, benchmark Notepad, or benchmark Explorer windows remained.

Note: Vercel check currently reports Authorization required to deploy; CodeRabbit was pending at last check.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@libs/cua-driver/rust/crates/platform-windows/src/msaa.rs`:
- Around line 182-187: The UiaNode construction in the MSAA implementation
hard-codes the enabled, visible, selected, and focused fields instead of
querying the actual accessibility state. Replace these hard-coded values by
calling get_accState() on the accessible object (using the same self_var
parameter pattern used for get_accRole, get_accName, and get_accDefaultAction),
define STATE_SYSTEM constants for the relevant bit flags, check the returned
VARIANT to extract the correct state values, and populate the enabled, visible,
selected, and focused fields based on the actual state flags. Apply this same
change to both locations where these hard-coded values appear, following the
existing error-handling pattern used for the other get_accXxx method calls.

In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs`:
- Around line 1102-1103: The find_element function (and similarly click_verified
and set_value_verified at lines 1223-1236 and 3703-3714) awaits the
spawn_blocking UIA walk without applying a timeout, which can cause the tools to
hang indefinitely if a UIA provider blocks. Apply the same timeout guard that is
already implemented in get_window_state to these functions by wrapping the
.await call with a timeout mechanism, so that if the UIA walk takes too long, it
returns a structured error instead of hanging the tool.
- Around line 3648-3655: The searchable_texts_from_nodes function currently
collects a broad set of text fields including metadata like automation_id,
class_name, and control_type alongside actual value/text content (name, value,
help_text). This causes expected_value verification to match against control
type names or class identifiers instead of actual content. Create two separate
lists: a narrower list containing only name, value, and help_text for
expected_value verification, and keep the broader list including automation_id,
class_name, and control_type for expected_label_present searches. Update the
callers of searchable_texts_from_nodes (including those around lines 3718-3719)
to use the appropriate narrower or broader list depending on whether they are
verifying actual values or searching for label presence.
- Around line 1240-1244: The current logic for present_ok and absent_ok only
checks the final post_labels state against expectations without verifying an
actual state change occurred from before to after the click. If a label was
already absent before the click, expected_absent will still be satisfied even if
the click had no effect. Capture the label state before the click operation
(pre_labels), then modify the present_ok check to verify the label transitioned
from absent to present (or was already present) and the absent_ok check to
verify the label transitioned from present to absent (or was already absent),
ensuring that success only returns true when the requested state change actually
occurred, not just when the final state happens to match expectations.
- Line 596: The stable_id field is incorrectly named and computed since it
includes idx and name parameters that can change across snapshots when the UI
tree reorders or labels change. Either refactor the stable_id calculation in the
format string to only include truly stable provider identifiers like backend,
pid, hwnd, and automation_id (removing idx and name), or rename this field to
something like snapshot_id or debug_id to accurately reflect that it is a
snapshot-local identifier rather than a durable stable identifier that clients
can persist across snapshots.
- Around line 876-882: The screenshot metadata being populated in the structured
JSON uses a hardcoded scale_factor of 1.0, but this value becomes incorrect when
the image is resized by resize_png_if_needed. Capture the original width and
height before calling resize_png_if_needed, then after the resize operation
completes and returns the new width (w) and height (h), calculate the actual
scale_factor by dividing the original width by the new width (orig_w / w).
Update the scale_factor field in the structured JSON with this calculated value
instead of the hardcoded 1.0 to accurately reflect the resize operation that
occurred.

In `@libs/python/agent/cua_agent/loops/coordinate_parser.py`:
- Around line 30-33: After converting the item to float in the coords append
operation, add validation to ensure the resulting float value is finite before
appending it to the coords list. Check that the float value is not infinity or
NaN (which can occur when coercing pathological numerics like 1e309), and raise
a ValueError with an appropriate message if the value is not finite. This
ensures downstream integer pixel operations do not fail due to infinite
coordinate values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da9ca25c-427a-4071-a598-77bcba67000b

📥 Commits

Reviewing files that changed from the base of the PR and between c898d7b and 705585d.

📒 Files selected for processing (6)
  • libs/cua-driver/rust/crates/platform-windows/src/msaa.rs
  • libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
  • libs/cua-driver/rust/crates/platform-windows/src/uia/mod.rs
  • libs/python/agent/cua_agent/loops/coordinate_parser.py
  • libs/python/agent/cua_agent/loops/uitars.py
  • libs/python/agent/tests/test_uitars_coordinate_parser.py

Comment thread libs/cua-driver/rust/crates/platform-windows/src/msaa.rs Outdated
Comment thread libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs Outdated
Comment thread libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs Outdated
Comment thread libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs Outdated
Comment thread libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs Outdated
Comment thread libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
Comment thread libs/python/agent/cua_agent/loops/coordinate_parser.py
@mustbearnold

Copy link
Copy Markdown
Author

Added a browser benchmark slice locally as the next step toward harder 10–20x computer-use evals.

New task:

  • browser_local_html_open: creates an isolated local HTML file with a unique sentinel, launches Microsoft Edge with a dedicated benchmark-only user-data-dir and --new-window, verifies the browser window and sentinel text through Cua Driver accessibility queries, then closes/kills only that benchmark-owned Edge process and removes temp resources.

Latest command:

py scripts/cua_driver_bench.py --runs 2 --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe

Result: 12/12 passed across six task classes:

{
  "browser_local_html_open": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "calculator_basic_click": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "explorer_temp_folder": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "notepad_edit_save": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "terminal_explorer_file_workflow": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "terminal_sentinel_file": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  }
}

Cleanup verified: no Calculator, benchmark Notepad, benchmark Explorer, or benchmark Browser windows remained.

Local helper tests now pass 15/15.

@mustbearnold

Copy link
Copy Markdown
Author

Added a harder multi-step browser interaction benchmark locally.

New task:

  • browser_button_state_change: creates an isolated local HTML page with a unique ready/done sentinel, launches Edge with a benchmark-only profile, verifies the initial ready text, finds the button through accessibility, invokes it using click_verified, verifies the post-click done text through the verified transaction's pre/post state delta, and cleans the benchmark-owned browser profile/window.

Latest command:

py scripts/cua_driver_bench.py --runs 2 --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe

Result: 14/14 passed across seven task classes:

{
  "browser_button_state_change": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "browser_local_html_open": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "calculator_basic_click": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "explorer_temp_folder": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "notepad_edit_save": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "terminal_explorer_file_workflow": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "terminal_sentinel_file": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  }
}

Cleanup verified: no Calculator, benchmark Notepad, benchmark Explorer, or benchmark Browser windows remained.

Local helper tests now pass 16/16.

@mustbearnold

Copy link
Copy Markdown
Author

Ignoring Vercel as requested, I ran a longer full local complex benchmark sample.

Command:

py scripts/cua_driver_bench.py --runs 5 --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe

Result: 35/35 passed across seven task classes:

{
  "browser_button_state_change": {
    "total": 5,
    "passed": 5,
    "verified": 5,
    "success": 5,
    "cleanup_success": 5
  },
  "browser_local_html_open": {
    "total": 5,
    "passed": 5,
    "verified": 5,
    "success": 5,
    "cleanup_success": 5
  },
  "calculator_basic_click": {
    "total": 5,
    "passed": 5,
    "verified": 5,
    "success": 5,
    "cleanup_success": 5
  },
  "explorer_temp_folder": {
    "total": 5,
    "passed": 5,
    "verified": 5,
    "success": 5,
    "cleanup_success": 5
  },
  "notepad_edit_save": {
    "total": 5,
    "passed": 5,
    "verified": 5,
    "success": 5,
    "cleanup_success": 5
  },
  "terminal_explorer_file_workflow": {
    "total": 5,
    "passed": 5,
    "verified": 5,
    "success": 5,
    "cleanup_success": 5
  },
  "terminal_sentinel_file": {
    "total": 5,
    "passed": 5,
    "verified": 5,
    "success": 5,
    "cleanup_success": 5
  }
}

Cleanup verified: no Calculator, benchmark Notepad, benchmark Explorer, or benchmark Browser windows remained.

Raw artifact:

reports/benchmarks/cua-driver-baseline-20260624-033221.jsonl

@mustbearnold

Copy link
Copy Markdown
Author

Pushed CodeRabbit finite-coordinate hardening fix.

Commit: f2f1fb2 fix(agent): reject non-finite UITARS coordinates

Verification run before commit:

py -m pytest libs/python/agent/tests/test_uitars_coordinate_parser.py -q  # 13 passed
py -m pytest tests/test_bench_cleanup_metrics.py tests/test_find_geometry_smoke.py -q  # 16 passed
git diff --check  # passed
py scripts/cua_driver_smoke.py --runs 1 --task all --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe  # 3/3 passed

@mustbearnold

Copy link
Copy Markdown
Author

Pushed CodeRabbit stable-id follow-up fix.

Commit: a25bbeb fix(driver): distinguish stable and snapshot element ids

What changed:

  • stable_id now uses durable provider identity when available: backend:pid:hwnd:automation_id:<id>.
  • Snapshot-local identity moved to snapshot_debug_id.
  • Existing element_token remains the snapshot token for within-snapshot lookup.

Verification:

cargo test -p platform-windows structured_element_record_tests --lib  # 3 passed
cargo test -p platform-windows --lib  # 80 passed
cargo check -p platform-windows  # passed
git diff --check  # passed

@mustbearnold

Copy link
Copy Markdown
Author

Pushed CodeRabbit screenshot metadata fix.

Commit: 4c8635b fix(driver): report scaled screenshot metadata

What changed:

  • Added screenshot_metadata helper.
  • screenshot.scale_factor now reflects original_width / resized_width when the screenshot is resized.
  • screenshot.coordinate_space reports scaled_window_pixels when resizing occurred.
  • screenshot.original_width is included for clients that need to map image coordinates back to window coordinates.

Verification:

cargo test -p platform-windows screenshot_metadata_tests --lib  # 2 passed
cargo test -p platform-windows --lib  # 82 passed
cargo check -p platform-windows  # passed
py scripts/cua_driver_smoke.py --runs 1 --task find_geometry --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe  # 1/1 passed
git diff --check  # passed

@mustbearnold

Copy link
Copy Markdown
Author

Pushed CodeRabbit UIA timeout fix.

Commit: ef0f8d9 fix(driver): bound verified UIA snapshot walks

What changed:

  • Added shared bounded UIA walk await helper with a 4s timeout.
  • Applied it to find_element, click_verified pre/post snapshots, and set_value_verified pre/post snapshots.
  • Timeout returns a structured tool error instead of allowing a provider hang to stall the tool indefinitely.

Verification:

cargo test -p platform-windows uia_walk_timeout_tests --lib  # 1 passed
cargo test -p platform-windows --lib  # 83 passed
cargo check -p platform-windows  # passed
py scripts/cua_driver_smoke.py --runs 1 --task all --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe  # 3/3 passed
git diff --check  # passed

@mustbearnold

Copy link
Copy Markdown
Author

Pushed CodeRabbit set-value verification fix.

Commit: 3fa8e57 fix(driver): verify set value against value text

What changed:

  • Added a narrow value_texts_from_nodes path for expected_value.
  • set_value_verified.expected_value no longer succeeds by matching role/class/automation metadata such as Button, Edit, or MSAA.
  • expected_label_present continues to use the broader searchable text path.

Verification:

cargo test -p platform-windows set_value_verified_text_tests --lib  # 1 passed
cargo test -p platform-windows --lib  # 84 passed
cargo check -p platform-windows  # passed
py scripts/cua_driver_smoke.py --runs 1 --task set_value_verified --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe  # 1/1 passed
git diff --check  # passed

@mustbearnold

Copy link
Copy Markdown
Author

Pushed CodeRabbit click expectation transition fix.

Commit: 97e5fdd fix(driver): require click expectation transitions

What changed:

  • click_verified now evaluates expected present/absent labels against pre/post state, not only final state.
  • Already-satisfied expectations are surfaced as already_satisfied and do not count as expected_change_satisfied.
  • Prevents wrong-window/no-op clicks from passing just because an expected absent label was absent before the click.

Verification:

cargo test -p platform-windows click_verified_expectation_tests --lib  # 2 passed
cargo test -p platform-windows --lib  # 86 passed
cargo check -p platform-windows  # passed
py scripts/cua_driver_smoke.py --runs 1 --task click_verified --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe  # 1/1 passed
git diff --check  # passed

@mustbearnold

Copy link
Copy Markdown
Author

Pushed CodeRabbit MSAA state fix.

Commit: e78213e fix(driver): derive MSAA element state flags

What changed:

  • Reads get_accState(CHILDID_SELF) in the MSAA walker.
  • Derives enabled, visible, selected, and focused from STATE_SYSTEM_* flags instead of hard-coding/defaulting them.
  • Keeps rect presence as part of visibility so offscreen/invisible/no-rect MSAA nodes do not appear visible.

Verification:

cargo test -p platform-windows state_flag_tests --lib  # 2 passed
cargo test -p platform-windows --lib  # 88 passed
cargo check -p platform-windows  # passed
py scripts/cua_driver_smoke.py --runs 1 --task find_geometry --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe  # 1/1 passed
git diff --check  # passed

@mustbearnold

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@mustbearnold

Copy link
Copy Markdown
Author

Post-review-fix full local benchmark run.

Command:

py scripts/cua_driver_bench.py --runs 2 --driver-bin upstream/cua/libs/cua-driver/rust/target/debug/cua-driver.exe

Result: 14/14 passed across seven task classes after the latest CodeRabbit follow-up fixes.

{
  "browser_button_state_change": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "browser_local_html_open": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "calculator_basic_click": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "explorer_temp_folder": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "notepad_edit_save": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "terminal_explorer_file_workflow": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  },
  "terminal_sentinel_file": {
    "total": 2,
    "passed": 2,
    "verified": 2,
    "success": 2,
    "cleanup_success": 2
  }
}

Cleanup verified: no Calculator, benchmark Notepad, benchmark Explorer, or benchmark Browser windows remained.

Raw artifact:

reports/benchmarks/cua-driver-baseline-20260624-043654.jsonl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant