From 1c040ae611856e633c0998e4668c0f4c68e67c65 Mon Sep 17 00:00:00 2001
From: Tim Gymnich <tim@gymni.ch>
Date: Tue, 7 Apr 2026 14:56:16 +0200
Subject: [PATCH 1/5] Add CLAUDE.md files for Claude Code guidance

Add top-level CLAUDE.md with project overview, build/test commands,
compilation flow, runtime options, and architecture. Add water/CLAUDE.md
and waveasm/CLAUDE.md covering each optional extension's build workflow,
dialect design, and pass pipeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Gymnich <tim@gymni.ch>
---
 .gitignore        |  3 ++
 CLAUDE.md         | 76 +++++++++++++++++++++++++++++++++++++++++++++++
 water/CLAUDE.md   | 63 +++++++++++++++++++++++++++++++++++++++
 waveasm/CLAUDE.md | 57 +++++++++++++++++++++++++++++++++++
 4 files changed, 199 insertions(+)
 create mode 100644 CLAUDE.md
 create mode 100644 water/CLAUDE.md
 create mode 100644 waveasm/CLAUDE.md

diff --git a/.gitignore b/.gitignore
index ac4f55092f..e482958e5b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -59,3 +59,6 @@ water/build_tools/wheel/water_mlir/water_mlir
 
 # rocm version detection
 requirements-pytorch-rocm-generated.txt
+
+# Claude
+CLAUDE.local.md
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000000..627c4be2cb
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,76 @@
+Wave is a Python DSL for high-performance ML kernel development targeting AMD GPUs (ROCm). The default compilation path is pure Python using IREE for codegen. Water and WaveASM are optional C++ extensions that replace parts of the IREE path — see @water/CLAUDE.md and @waveasm/CLAUDE.md.
+
+## Commands
+
+### Setup
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements-iree-pinned.txt
+pip install -r pytorch-cpu-requirements.txt  # CPU-only dev/testing
+pip install -e ".[dev]"
+pre-commit install && pre-commit install --hook-type commit-msg
+```
+
+### Testing
+```bash
+pytest -n 4 --capture=tee-sys -vv ./tests/unittests/   # unit tests
+pytest -s tests/unittests/test_file.py::test_name -v   # single test
+lit lit_tests/ -vv                                     # MLIR LIT tests
+pytest -s tests/ --run-e2e                             # GPU tests (requires hardware)
+```
+
+### Linting
+```bash
+mypy                        # type check wave_lang
+pre-commit run --all-files  # Black, Ruff, clang-format
+```
+
+### Gotchas
+- **Always set `WAVE_CACHE_ON=0`** when testing code changes — stale cache entries hide the effect of edits: `WAVE_CACHE_ON=0 pytest ...`
+- DCO sign-off required on commits: `git commit -s`
+- Dump MLIR for debugging: `pytest --dump-mlir-files-path=/tmp/mlir tests/`
+
+## Architecture
+
+### Compilation Flow
+
+```
+Wave Python DSL
+    ↓  graph transformation passes  [wave_lang/kernel/wave/codegen/]
+Transformed FX graph
+    ↓  WaveEmitter  [compiler/wave_codegen/emitter.py]
+stream.executable MLIR
+    ↓  iree.compiler.compile_str()  [wave/utils/compile_utils.py]
+VMFB (IREE bytecode module)
+    ↓  iree.runtime.VmModule
+GPU kernel execution
+```
+
+Entry point: `wave_compile()` in `wave_lang/kernel/wave/compile.py`.
+
+### Runtimes
+
+**IREE runtime (default):** Loads VMFB into IREE's VM. Handles GPU command buffers, queue submission, benchmarking, multi-device.
+
+**Wave runtime (`options.wave_runtime=True`):** Launches HSACO kernels directly via HIP API. Supports dynamic strides and custom grid layout. Typically paired with WaveASM. Entry point: `invoke_with_wave_runtime()` in `wave_lang/kernel/wave/utils/run_utils.py`.
+
+### Key Source Locations
+
+- `wave_lang/kernel/wave/compile.py` — pipeline orchestration, backend/runtime selection
+- `wave_lang/kernel/wave/codegen/` — graph transformation passes (scheduling, barriers, index analysis)
+- `wave_lang/kernel/compiler/wave_codegen/emitter.py` — lowers FX graph to MLIR
+- `wave_lang/kernel/wave/water.py` — Water/WaveASM lowering pipeline entry points
+- `wave_lang/kernel/wave/mlir_converter/` — Wave FX ↔ Water MLIR conversion; runs in a subprocess to avoid MLIR library conflicts (Water backend only)
+
+### Optional Extensions
+
+Water and WaveASM intercept MLIR before IREE and produce HSACO directly. Enable via env vars:
+
+| Variable | Purpose |
+|---|---|
+| `WAVE_BUILD_WATER=1` | Build Water from source |
+| `WAVE_BUILD_WAVEASM=1` | Build WaveASM from source |
+| `WAVE_WATER_DIR=water/build` | Use existing Water build (fast) |
+| `WAVE_WAVEASM_DIR=waveasm/build` | Use existing WaveASM build (fast) |
+
+When both active: stream.executable MLIR → `water-opt` → `waveasm-translate` → `water-opt` → ExecutionEngine.
diff --git a/water/CLAUDE.md b/water/CLAUDE.md
new file mode 100644
index 0000000000..7387433689
--- /dev/null
+++ b/water/CLAUDE.md
@@ -0,0 +1,63 @@
+Water is an optional MLIR layer in the Wave compiler stack that replaces IREE's middle-end lowering. It defines the `wave.*` and `normalform.*` dialects, transformation passes, and Python bindings (`water_mlir` package).
+
+## Building
+
+```bash
+# First build — builds LLVM from source, takes a while
+WAVE_WATER_DIR=water/build pip install -e ".[dev]"
+
+# Iterating on C++ changes
+ninja -C water/build          # rebuild changed targets only
+pip install -e ".[dev]"       # re-links Python extension (fast, skips CMake)
+```
+
+`WAVE_WATER_DIR` tells the Wave build system to use an existing build directory instead of rebuilding from scratch. Without it, the full LLVM + Water CMake build runs on every `pip install`.
+
+LLVM is pinned at `water/llvm-sha.txt`. CLI tool: `water-opt` (analogous to `mlir-opt`).
+
+## Testing
+
+```bash
+ninja -C water/build check-water        # all lit tests
+lit test/Dialect/Wave/<test>.mlir -vv   # single test
+```
+
+Tests use lit + FileCheck. `.mlir` files use `// CHECK` comments. Negative tests are named `*-invalid.mlir`.
+
+## Architecture
+
+### Dialects
+
+**`wave.*`** — primary dialect. `wave.tensor` has symbolic shapes (unknown until inferred by passes) and an address space (`Global`, `Shared`, `Register`). Each op carries a `WaveIndexMappingAttr` encoding element distribution across device/workgroup/workitem/register dimensions as `(offset, count, step)` triples.
+
+**`normalform.*`** — `normalform.module` wraps IR and enforces declared invariants. Passes declare pre/post-conditions as normal form attributes, enabling composable pass ordering without new IR constructs.
+
+### Pass Pipeline
+
+`water-middle-end-lowering` runs these in order (`include/water/Dialect/Wave/Transforms/Passes.td`):
+
+| Pass | Purpose |
+|---|---|
+| `water-wave-detect-normal-forms` | Detect satisfied invariants |
+| `water-wave-infer-types` | Shape inference via dataflow |
+| `water-wave-infer-index-exprs` | Forward/backward index expression propagation |
+| `water-wave-propagate-elements-per-thread` | Replace register tensors with vector types |
+| `water-wave-resolve-distributed-allocations` | Map distributed shapes to concrete memref layouts |
+| `lower-wave-to-mlir` | Lower to arith/math/vector/memref dialects |
+| `lower-normalform-module` | Remove the normalform wrapper |
+
+Generic passes include SLP vectorization, bounds-checking assertions, alloc-to-alloca, and GPU module serialization (ROCDL).
+
+### Python Bindings
+
+Package `water_mlir` (prefixed to avoid IREE conflicts):
+- `water_mlir.dialects.wave` — auto-generated op bindings from `WaveOps.td`
+- `water_mlir.sympy_to_affine_converter` — converts SymPy expressions to MLIR affine expressions
+- C++ extension via nanobind (`WaterExtensionNanobind.cpp`)
+
+### Key Design Principles
+
+- **Lazy type inference**: `wave.tensor` shapes start unknown — don't assume they're set at construction.
+- **Elements-per-thread (EPT)**: tracked separately from types; required before register tensors can be lowered to vector types. A pass that changes element counts must update EPT.
+- **`water_mlir` prefix**: the Python package is prefixed to avoid conflicts with IREE's MLIR bindings. Import as `from water_mlir.dialects import wave`, not `mlir.dialects.wave`.
+- **subprocess isolation**: the Wave-side `mlir_converter` runs Water in a subprocess specifically to avoid MLIR library symbol clashes with IREE.
diff --git a/waveasm/CLAUDE.md b/waveasm/CLAUDE.md
new file mode 100644
index 0000000000..47b676655d
--- /dev/null
+++ b/waveasm/CLAUDE.md
@@ -0,0 +1,57 @@
+WaveASM is an optional C++ backend in the Wave compiler stack that replaces IREE's GPU codegen. It translates MLIR into AMDGCN assembly for AMD GPUs (gfx942/CDNA3, gfx950/CDNA3.5, gfx1250/RDNA4) and produces `.hsaco` binaries via its own `waveasm.*` MLIR dialect, linear-scan register allocator, and assembly emitter.
+
+## Building
+
+```bash
+# First build
+WAVE_BUILD_WAVEASM=1 pip install -e ".[dev]"
+
+# Iterating on C++ changes (same pattern as Water)
+ninja -C waveasm/build
+pip install -e ".[dev]"   # re-links extension, skips CMake
+```
+
+Set `WAVE_WAVEASM_DIR=waveasm/build` after first build to avoid full rebuilds on pip install. CLI tool: `waveasm-translate`.
+
+## Testing
+
+```bash
+ninja -C waveasm/build check-waveasm      # lit regression tests
+ninja -C waveasm/build check-waveasm-all  # + GPU functional tests (requires hardware)
+lit test/Transforms/<test>.mlir -vv       # single test
+```
+
+## Architecture
+
+### Compilation Pipeline
+
+```
+Input MLIR (gpu, arith, vector, memref, scf, amdgpu dialects)
+    ↓  TranslateFromMLIR  [lib/Transforms/TranslateFromMLIR.cpp]
+WaveASM IR (virtual registers, pseudo-ops)
+    ↓  ScopedCSE, Peephole, BufferLoadStrengthReduction
+    ↓  ArithLegalization
+Concrete SALU/VALU machine ops
+    ↓  Liveness → LinearScanRegAlloc → VGPRCompaction
+Physical register assignments
+    ↓  Ticketing, HazardMitigation
+    ↓  AssemblyEmitter → clang++
+.hsaco GPU binary
+```
+
+### Dialect
+
+Types (`WaveASMTypes.td`): virtual (`!waveasm.vreg/sreg/areg`) and physical (`!waveasm.pvreg/psreg/pareg`) register types, plus `!waveasm.imm` and `!waveasm.scc`. The two-phase virtual→physical split is intentional — optimization passes run on virtual SSA, allocation happens once at the end.
+
+~300 machine ops in `WaveASMOps.td`: VALU, SALU, MFMA, memory (global/LDS/SMEM), control flow, and utility ops. Pseudo-ops (`waveasm.arith.*`) exist for cases where the concrete instruction depends on register class — ArithLegalization resolves them.
+
+### Adding New Dialect Support
+
+`TranslateFromMLIR` uses a handler registry. To translate a new upstream op, add a handler to the appropriate file in `lib/Transforms/handlers/` and register it in the `TranslationContext`. The `TranslationContext` also manages the SRD (Shader Resource Descriptor) table and expression cache — use it rather than tracking state locally in handlers.
+
+### Non-Obvious Constraints
+
+- **No spilling**: `LinearScanRegAlloc` aborts if register pressure exceeds hardware limits. If you see allocation failures, the kernel uses too many live values simultaneously.
+- **Tied operands**: MFMA accumulator input and output must share the same physical registers. This is expressed via `TiedClass` equivalence classes in `Liveness` — new MFMA variants must declare their ties correctly.
+- **SCC liveness**: `!waveasm.scc` is an implicit 1-bit condition code, not a normal SSA value. The SCC verifier enforces that SCC is consumed before the next instruction that overwrites it. SCC spill/reload uses `s_cselect_b32` / `s_cmp_ne`.
+- **Ticketing**: `s_waitcnt` insertion is demand-driven via ticket tracking, not conservative. Passes that add new memory ops must ensure they participate in the ticket system.

From 616e8ca26eb5ba85eafe3c92d49dc5f853a0ed52 Mon Sep 17 00:00:00 2001
From: Tim Gymnich <tim@gymni.ch>
Date: Tue, 7 Apr 2026 15:09:47 +0200
Subject: [PATCH 2/5] Remove @ imports and mentions of water/waveasm CLAUDE.md
 from root

@ imports always load at session start. Child CLAUDE.md files load
on demand automatically when Claude works in those directories.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Gymnich <tim@gymni.ch>
---
 CLAUDE.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 627c4be2cb..e3685eed37 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,4 +1,4 @@
-Wave is a Python DSL for high-performance ML kernel development targeting AMD GPUs (ROCm). The default compilation path is pure Python using IREE for codegen. Water and WaveASM are optional C++ extensions that replace parts of the IREE path — see @water/CLAUDE.md and @waveasm/CLAUDE.md.
+Wave is a Python DSL for high-performance ML kernel development targeting AMD GPUs (ROCm). The default compilation path is pure Python using IREE for codegen. Water and WaveASM are optional C++ extensions that replace parts of the IREE path.
 
 ## Commands
 

From 14a36a3b0d70c1a4ebe08088fd21cfc2ff49bbca Mon Sep 17 00:00:00 2001
From: Tim Gymnich <tim@gymni.ch>
Date: Wed, 8 Apr 2026 11:40:04 +0200
Subject: [PATCH 3/5] Add AGENTS.md

Signed-off-by: Tim Gymnich <tim@gymni.ch>
---
 AGENTS.md         | 76 ++++++++++++++++++++++++++++++++++++++++++++++
 CLAUDE.md         | 77 +----------------------------------------------
 water/AGENTS.md   | 63 ++++++++++++++++++++++++++++++++++++++
 water/CLAUDE.md   | 64 +--------------------------------------
 waveasm/AGENTS.md | 50 ++++++++++++++++++++++++++++++
 waveasm/CLAUDE.md | 58 +----------------------------------
 6 files changed, 192 insertions(+), 196 deletions(-)
 create mode 100644 AGENTS.md
 create mode 100644 water/AGENTS.md
 create mode 100644 waveasm/AGENTS.md

diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000000..e3685eed37
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,76 @@
+Wave is a Python DSL for high-performance ML kernel development targeting AMD GPUs (ROCm). The default compilation path is pure Python using IREE for codegen. Water and WaveASM are optional C++ extensions that replace parts of the IREE path.
+
+## Commands
+
+### Setup
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements-iree-pinned.txt
+pip install -r pytorch-cpu-requirements.txt  # CPU-only dev/testing
+pip install -e ".[dev]"
+pre-commit install && pre-commit install --hook-type commit-msg
+```
+
+### Testing
+```bash
+pytest -n 4 --capture=tee-sys -vv ./tests/unittests/   # unit tests
+pytest -s tests/unittests/test_file.py::test_name -v   # single test
+lit lit_tests/ -vv                                     # MLIR LIT tests
+pytest -s tests/ --run-e2e                             # GPU tests (requires hardware)
+```
+
+### Linting
+```bash
+mypy                        # type check wave_lang
+pre-commit run --all-files  # Black, Ruff, clang-format
+```
+
+### Gotchas
+- **Always set `WAVE_CACHE_ON=0`** when testing code changes — stale cache entries hide the effect of edits: `WAVE_CACHE_ON=0 pytest ...`
+- DCO sign-off required on commits: `git commit -s`
+- Dump MLIR for debugging: `pytest --dump-mlir-files-path=/tmp/mlir tests/`
+
+## Architecture
+
+### Compilation Flow
+
+```
+Wave Python DSL
+    ↓  graph transformation passes  [wave_lang/kernel/wave/codegen/]
+Transformed FX graph
+    ↓  WaveEmitter  [compiler/wave_codegen/emitter.py]
+stream.executable MLIR
+    ↓  iree.compiler.compile_str()  [wave/utils/compile_utils.py]
+VMFB (IREE bytecode module)
+    ↓  iree.runtime.VmModule
+GPU kernel execution
+```
+
+Entry point: `wave_compile()` in `wave_lang/kernel/wave/compile.py`.
+
+### Runtimes
+
+**IREE runtime (default):** Loads VMFB into IREE's VM. Handles GPU command buffers, queue submission, benchmarking, multi-device.
+
+**Wave runtime (`options.wave_runtime=True`):** Launches HSACO kernels directly via HIP API. Supports dynamic strides and custom grid layout. Typically paired with WaveASM. Entry point: `invoke_with_wave_runtime()` in `wave_lang/kernel/wave/utils/run_utils.py`.
+
+### Key Source Locations
+
+- `wave_lang/kernel/wave/compile.py` — pipeline orchestration, backend/runtime selection
+- `wave_lang/kernel/wave/codegen/` — graph transformation passes (scheduling, barriers, index analysis)
+- `wave_lang/kernel/compiler/wave_codegen/emitter.py` — lowers FX graph to MLIR
+- `wave_lang/kernel/wave/water.py` — Water/WaveASM lowering pipeline entry points
+- `wave_lang/kernel/wave/mlir_converter/` — Wave FX ↔ Water MLIR conversion; runs in a subprocess to avoid MLIR library conflicts (Water backend only)
+
+### Optional Extensions
+
+Water and WaveASM intercept MLIR before IREE and produce HSACO directly. Enable via env vars:
+
+| Variable | Purpose |
+|---|---|
+| `WAVE_BUILD_WATER=1` | Build Water from source |
+| `WAVE_BUILD_WAVEASM=1` | Build WaveASM from source |
+| `WAVE_WATER_DIR=water/build` | Use existing Water build (fast) |
+| `WAVE_WAVEASM_DIR=waveasm/build` | Use existing WaveASM build (fast) |
+
+When both active: stream.executable MLIR → `water-opt` → `waveasm-translate` → `water-opt` → ExecutionEngine.
diff --git a/CLAUDE.md b/CLAUDE.md
index e3685eed37..10ddb199c8 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,76 +1 @@
-Wave is a Python DSL for high-performance ML kernel development targeting AMD GPUs (ROCm). The default compilation path is pure Python using IREE for codegen. Water and WaveASM are optional C++ extensions that replace parts of the IREE path.
-
-## Commands
-
-### Setup
-```bash
-python -m venv .venv && source .venv/bin/activate
-pip install -r requirements-iree-pinned.txt
-pip install -r pytorch-cpu-requirements.txt  # CPU-only dev/testing
-pip install -e ".[dev]"
-pre-commit install && pre-commit install --hook-type commit-msg
-```
-
-### Testing
-```bash
-pytest -n 4 --capture=tee-sys -vv ./tests/unittests/   # unit tests
-pytest -s tests/unittests/test_file.py::test_name -v   # single test
-lit lit_tests/ -vv                                     # MLIR LIT tests
-pytest -s tests/ --run-e2e                             # GPU tests (requires hardware)
-```
-
-### Linting
-```bash
-mypy                        # type check wave_lang
-pre-commit run --all-files  # Black, Ruff, clang-format
-```
-
-### Gotchas
-- **Always set `WAVE_CACHE_ON=0`** when testing code changes — stale cache entries hide the effect of edits: `WAVE_CACHE_ON=0 pytest ...`
-- DCO sign-off required on commits: `git commit -s`
-- Dump MLIR for debugging: `pytest --dump-mlir-files-path=/tmp/mlir tests/`
-
-## Architecture
-
-### Compilation Flow
-
-```
-Wave Python DSL
-    ↓  graph transformation passes  [wave_lang/kernel/wave/codegen/]
-Transformed FX graph
-    ↓  WaveEmitter  [compiler/wave_codegen/emitter.py]
-stream.executable MLIR
-    ↓  iree.compiler.compile_str()  [wave/utils/compile_utils.py]
-VMFB (IREE bytecode module)
-    ↓  iree.runtime.VmModule
-GPU kernel execution
-```
-
-Entry point: `wave_compile()` in `wave_lang/kernel/wave/compile.py`.
-
-### Runtimes
-
-**IREE runtime (default):** Loads VMFB into IREE's VM. Handles GPU command buffers, queue submission, benchmarking, multi-device.
-
-**Wave runtime (`options.wave_runtime=True`):** Launches HSACO kernels directly via HIP API. Supports dynamic strides and custom grid layout. Typically paired with WaveASM. Entry point: `invoke_with_wave_runtime()` in `wave_lang/kernel/wave/utils/run_utils.py`.
-
-### Key Source Locations
-
-- `wave_lang/kernel/wave/compile.py` — pipeline orchestration, backend/runtime selection
-- `wave_lang/kernel/wave/codegen/` — graph transformation passes (scheduling, barriers, index analysis)
-- `wave_lang/kernel/compiler/wave_codegen/emitter.py` — lowers FX graph to MLIR
-- `wave_lang/kernel/wave/water.py` — Water/WaveASM lowering pipeline entry points
-- `wave_lang/kernel/wave/mlir_converter/` — Wave FX ↔ Water MLIR conversion; runs in a subprocess to avoid MLIR library conflicts (Water backend only)
-
-### Optional Extensions
-
-Water and WaveASM intercept MLIR before IREE and produce HSACO directly. Enable via env vars:
-
-| Variable | Purpose |
-|---|---|
-| `WAVE_BUILD_WATER=1` | Build Water from source |
-| `WAVE_BUILD_WAVEASM=1` | Build WaveASM from source |
-| `WAVE_WATER_DIR=water/build` | Use existing Water build (fast) |
-| `WAVE_WAVEASM_DIR=waveasm/build` | Use existing WaveASM build (fast) |
-
-When both active: stream.executable MLIR → `water-opt` → `waveasm-translate` → `water-opt` → ExecutionEngine.
+See @AGENTS.md
diff --git a/water/AGENTS.md b/water/AGENTS.md
new file mode 100644
index 0000000000..7387433689
--- /dev/null
+++ b/water/AGENTS.md
@@ -0,0 +1,63 @@
+Water is an optional MLIR layer in the Wave compiler stack that replaces IREE's middle-end lowering. It defines the `wave.*` and `normalform.*` dialects, transformation passes, and Python bindings (`water_mlir` package).
+
+## Building
+
+```bash
+# First build — builds LLVM from source, takes a while
+WAVE_WATER_DIR=water/build pip install -e ".[dev]"
+
+# Iterating on C++ changes
+ninja -C water/build          # rebuild changed targets only
+pip install -e ".[dev]"       # re-links Python extension (fast, skips CMake)
+```
+
+`WAVE_WATER_DIR` tells the Wave build system to use an existing build directory instead of rebuilding from scratch. Without it, the full LLVM + Water CMake build runs on every `pip install`.
+
+LLVM is pinned at `water/llvm-sha.txt`. CLI tool: `water-opt` (analogous to `mlir-opt`).
+
+## Testing
+
+```bash
+ninja -C water/build check-water        # all lit tests
+lit test/Dialect/Wave/<test>.mlir -vv   # single test
+```
+
+Tests use lit + FileCheck. `.mlir` files use `// CHECK` comments. Negative tests are named `*-invalid.mlir`.
+
+## Architecture
+
+### Dialects
+
+**`wave.*`** — primary dialect. `wave.tensor` has symbolic shapes (unknown until inferred by passes) and an address space (`Global`, `Shared`, `Register`). Each op carries a `WaveIndexMappingAttr` encoding element distribution across device/workgroup/workitem/register dimensions as `(offset, count, step)` triples.
+
+**`normalform.*`** — `normalform.module` wraps IR and enforces declared invariants. Passes declare pre/post-conditions as normal form attributes, enabling composable pass ordering without new IR constructs.
+
+### Pass Pipeline
+
+`water-middle-end-lowering` runs these in order (`include/water/Dialect/Wave/Transforms/Passes.td`):
+
+| Pass | Purpose |
+|---|---|
+| `water-wave-detect-normal-forms` | Detect satisfied invariants |
+| `water-wave-infer-types` | Shape inference via dataflow |
+| `water-wave-infer-index-exprs` | Forward/backward index expression propagation |
+| `water-wave-propagate-elements-per-thread` | Replace register tensors with vector types |
+| `water-wave-resolve-distributed-allocations` | Map distributed shapes to concrete memref layouts |
+| `lower-wave-to-mlir` | Lower to arith/math/vector/memref dialects |
+| `lower-normalform-module` | Remove the normalform wrapper |
+
+Generic passes include SLP vectorization, bounds-checking assertions, alloc-to-alloca, and GPU module serialization (ROCDL).
+
+### Python Bindings
+
+Package `water_mlir` (prefixed to avoid IREE conflicts):
+- `water_mlir.dialects.wave` — auto-generated op bindings from `WaveOps.td`
+- `water_mlir.sympy_to_affine_converter` — converts SymPy expressions to MLIR affine expressions
+- C++ extension via nanobind (`WaterExtensionNanobind.cpp`)
+
+### Key Design Principles
+
+- **Lazy type inference**: `wave.tensor` shapes start unknown — don't assume they're set at construction.
+- **Elements-per-thread (EPT)**: tracked separately from types; required before register tensors can be lowered to vector types. A pass that changes element counts must update EPT.
+- **`water_mlir` prefix**: the Python package is prefixed to avoid conflicts with IREE's MLIR bindings. Import as `from water_mlir.dialects import wave`, not `mlir.dialects.wave`.
+- **subprocess isolation**: the Wave-side `mlir_converter` runs Water in a subprocess specifically to avoid MLIR library symbol clashes with IREE.
diff --git a/water/CLAUDE.md b/water/CLAUDE.md
index 7387433689..10ddb199c8 100644
--- a/water/CLAUDE.md
+++ b/water/CLAUDE.md
@@ -1,63 +1 @@
-Water is an optional MLIR layer in the Wave compiler stack that replaces IREE's middle-end lowering. It defines the `wave.*` and `normalform.*` dialects, transformation passes, and Python bindings (`water_mlir` package).
-
-## Building
-
-```bash
-# First build — builds LLVM from source, takes a while
-WAVE_WATER_DIR=water/build pip install -e ".[dev]"
-
-# Iterating on C++ changes
-ninja -C water/build          # rebuild changed targets only
-pip install -e ".[dev]"       # re-links Python extension (fast, skips CMake)
-```
-
-`WAVE_WATER_DIR` tells the Wave build system to use an existing build directory instead of rebuilding from scratch. Without it, the full LLVM + Water CMake build runs on every `pip install`.
-
-LLVM is pinned at `water/llvm-sha.txt`. CLI tool: `water-opt` (analogous to `mlir-opt`).
-
-## Testing
-
-```bash
-ninja -C water/build check-water        # all lit tests
-lit test/Dialect/Wave/<test>.mlir -vv   # single test
-```
-
-Tests use lit + FileCheck. `.mlir` files use `// CHECK` comments. Negative tests are named `*-invalid.mlir`.
-
-## Architecture
-
-### Dialects
-
-**`wave.*`** — primary dialect. `wave.tensor` has symbolic shapes (unknown until inferred by passes) and an address space (`Global`, `Shared`, `Register`). Each op carries a `WaveIndexMappingAttr` encoding element distribution across device/workgroup/workitem/register dimensions as `(offset, count, step)` triples.
-
-**`normalform.*`** — `normalform.module` wraps IR and enforces declared invariants. Passes declare pre/post-conditions as normal form attributes, enabling composable pass ordering without new IR constructs.
-
-### Pass Pipeline
-
-`water-middle-end-lowering` runs these in order (`include/water/Dialect/Wave/Transforms/Passes.td`):
-
-| Pass | Purpose |
-|---|---|
-| `water-wave-detect-normal-forms` | Detect satisfied invariants |
-| `water-wave-infer-types` | Shape inference via dataflow |
-| `water-wave-infer-index-exprs` | Forward/backward index expression propagation |
-| `water-wave-propagate-elements-per-thread` | Replace register tensors with vector types |
-| `water-wave-resolve-distributed-allocations` | Map distributed shapes to concrete memref layouts |
-| `lower-wave-to-mlir` | Lower to arith/math/vector/memref dialects |
-| `lower-normalform-module` | Remove the normalform wrapper |
-
-Generic passes include SLP vectorization, bounds-checking assertions, alloc-to-alloca, and GPU module serialization (ROCDL).
-
-### Python Bindings
-
-Package `water_mlir` (prefixed to avoid IREE conflicts):
-- `water_mlir.dialects.wave` — auto-generated op bindings from `WaveOps.td`
-- `water_mlir.sympy_to_affine_converter` — converts SymPy expressions to MLIR affine expressions
-- C++ extension via nanobind (`WaterExtensionNanobind.cpp`)
-
-### Key Design Principles
-
-- **Lazy type inference**: `wave.tensor` shapes start unknown — don't assume they're set at construction.
-- **Elements-per-thread (EPT)**: tracked separately from types; required before register tensors can be lowered to vector types. A pass that changes element counts must update EPT.
-- **`water_mlir` prefix**: the Python package is prefixed to avoid conflicts with IREE's MLIR bindings. Import as `from water_mlir.dialects import wave`, not `mlir.dialects.wave`.
-- **subprocess isolation**: the Wave-side `mlir_converter` runs Water in a subprocess specifically to avoid MLIR library symbol clashes with IREE.
+See @AGENTS.md
diff --git a/waveasm/AGENTS.md b/waveasm/AGENTS.md
new file mode 100644
index 0000000000..43a3c852f5
--- /dev/null
+++ b/waveasm/AGENTS.md
@@ -0,0 +1,50 @@
+WaveASM is an optional C++ backend in the Wave compiler stack that replaces IREE's GPU codegen. It translates MLIR into AMDGCN assembly for AMD GPUs (gfx942/CDNA3, gfx950/CDNA3.5, gfx1250/RDNA4) and produces `.hsaco` binaries via its own `waveasm.*` MLIR dialect, linear-scan register allocator, and assembly emitter.
+
+## Building
+
+```bash
+# First build
+WAVE_BUILD_WAVEASM=1 pip install -e ".[dev]"
+
+# Iterating on C++ changes (same pattern as Water)
+ninja -C waveasm/build
+pip install -e ".[dev]"   # re-links extension, skips CMake
+```
+
+Set `WAVE_WAVEASM_DIR=waveasm/build` after first build to avoid full rebuilds on pip install. CLI tool: `waveasm-translate`.
+
+## Testing
+
+```bash
+ninja -C waveasm/build check-waveasm      # lit regression tests
+ninja -C waveasm/build check-waveasm-all  # + GPU functional tests (requires hardware)
+lit test/Transforms/<test>.mlir -vv       # single test
+```
+
+## Architecture
+
+### Compilation Pipeline
+
+```
+Input MLIR (gpu, arith, vector, memref, scf, amdgpu dialects)
+    ↓  TranslateFromMLIR  [lib/Transforms/TranslateFromMLIR.cpp]
+WaveASM IR (virtual registers, pseudo-ops)
+    ↓  ScopedCSE, Peephole, BufferLoadStrengthReduction
+    ↓  ArithLegalization
+Concrete SALU/VALU machine ops
+    ↓  Liveness → LinearScanRegAlloc → VGPRCompaction
+Physical register assignments
+    ↓  Ticketing, HazardMitigation
+    ↓  AssemblyEmitter → clang++
+.hsaco GPU binary
+```
+
+### Dialect
+
+Types (`WaveASMTypes.td`): virtual (`!waveasm.vreg/sreg/areg`) and physical (`!waveasm.pvreg/psreg/pareg`) register types, plus `!waveasm.imm` and `!waveasm.scc`. The two-phase virtual→physical split is intentional — optimization passes run on virtual SSA, allocation happens once at the end.
+
+~300 machine ops in `WaveASMOps.td`: VALU, SALU, MFMA, memory (global/LDS/SMEM), control flow, and utility ops. Pseudo-ops (`waveasm.arith.*`) exist for cases where the concrete instruction depends on register class — ArithLegalization resolves them.
+
+### Adding New Dialect Support
+
+`TranslateFromMLIR` uses a handler registry. To translate a new upstream op, add a handler to the appropriate file in `lib/Transforms/handlers/` and register it in the `TranslationContext`. The `TranslationContext` also manages the SRD (Shader Resource Descriptor) table and expression cache — use it rather than tracking state locally in handlers.
diff --git a/waveasm/CLAUDE.md b/waveasm/CLAUDE.md
index 47b676655d..10ddb199c8 100644
--- a/waveasm/CLAUDE.md
+++ b/waveasm/CLAUDE.md
@@ -1,57 +1 @@
-WaveASM is an optional C++ backend in the Wave compiler stack that replaces IREE's GPU codegen. It translates MLIR into AMDGCN assembly for AMD GPUs (gfx942/CDNA3, gfx950/CDNA3.5, gfx1250/RDNA4) and produces `.hsaco` binaries via its own `waveasm.*` MLIR dialect, linear-scan register allocator, and assembly emitter.
-
-## Building
-
-```bash
-# First build
-WAVE_BUILD_WAVEASM=1 pip install -e ".[dev]"
-
-# Iterating on C++ changes (same pattern as Water)
-ninja -C waveasm/build
-pip install -e ".[dev]"   # re-links extension, skips CMake
-```
-
-Set `WAVE_WAVEASM_DIR=waveasm/build` after first build to avoid full rebuilds on pip install. CLI tool: `waveasm-translate`.
-
-## Testing
-
-```bash
-ninja -C waveasm/build check-waveasm      # lit regression tests
-ninja -C waveasm/build check-waveasm-all  # + GPU functional tests (requires hardware)
-lit test/Transforms/<test>.mlir -vv       # single test
-```
-
-## Architecture
-
-### Compilation Pipeline
-
-```
-Input MLIR (gpu, arith, vector, memref, scf, amdgpu dialects)
-    ↓  TranslateFromMLIR  [lib/Transforms/TranslateFromMLIR.cpp]
-WaveASM IR (virtual registers, pseudo-ops)
-    ↓  ScopedCSE, Peephole, BufferLoadStrengthReduction
-    ↓  ArithLegalization
-Concrete SALU/VALU machine ops
-    ↓  Liveness → LinearScanRegAlloc → VGPRCompaction
-Physical register assignments
-    ↓  Ticketing, HazardMitigation
-    ↓  AssemblyEmitter → clang++
-.hsaco GPU binary
-```
-
-### Dialect
-
-Types (`WaveASMTypes.td`): virtual (`!waveasm.vreg/sreg/areg`) and physical (`!waveasm.pvreg/psreg/pareg`) register types, plus `!waveasm.imm` and `!waveasm.scc`. The two-phase virtual→physical split is intentional — optimization passes run on virtual SSA, allocation happens once at the end.
-
-~300 machine ops in `WaveASMOps.td`: VALU, SALU, MFMA, memory (global/LDS/SMEM), control flow, and utility ops. Pseudo-ops (`waveasm.arith.*`) exist for cases where the concrete instruction depends on register class — ArithLegalization resolves them.
-
-### Adding New Dialect Support
-
-`TranslateFromMLIR` uses a handler registry. To translate a new upstream op, add a handler to the appropriate file in `lib/Transforms/handlers/` and register it in the `TranslationContext`. The `TranslationContext` also manages the SRD (Shader Resource Descriptor) table and expression cache — use it rather than tracking state locally in handlers.
-
-### Non-Obvious Constraints
-
-- **No spilling**: `LinearScanRegAlloc` aborts if register pressure exceeds hardware limits. If you see allocation failures, the kernel uses too many live values simultaneously.
-- **Tied operands**: MFMA accumulator input and output must share the same physical registers. This is expressed via `TiedClass` equivalence classes in `Liveness` — new MFMA variants must declare their ties correctly.
-- **SCC liveness**: `!waveasm.scc` is an implicit 1-bit condition code, not a normal SSA value. The SCC verifier enforces that SCC is consumed before the next instruction that overwrites it. SCC spill/reload uses `s_cselect_b32` / `s_cmp_ne`.
-- **Ticketing**: `s_waitcnt` insertion is demand-driven via ticket tracking, not conservative. Passes that add new memory ops must ensure they participate in the ticket system.
+See @AGENTS.md

From 94d61643b9a5e23c9455672d39f1d781df7adafe Mon Sep 17 00:00:00 2001
From: Tim Gymnich <tim@gymni.ch>
Date: Wed, 8 Apr 2026 11:44:44 +0200
Subject: [PATCH 4/5] Add clang-format formatting guidance to water and waveasm
 AGENTS.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Gymnich <tim@gymni.ch>
---
 water/AGENTS.md   | 9 +++++++++
 waveasm/AGENTS.md | 9 +++++++++
 2 files changed, 18 insertions(+)

diff --git a/water/AGENTS.md b/water/AGENTS.md
index 7387433689..0c43cd491b 100644
--- a/water/AGENTS.md
+++ b/water/AGENTS.md
@@ -15,6 +15,15 @@ pip install -e ".[dev]"       # re-links Python extension (fast, skips CMake)
 
 LLVM is pinned at `water/llvm-sha.txt`. CLI tool: `water-opt` (analogous to `mlir-opt`).
 
+## Formatting
+
+C++ code is formatted with `clang-format`. Run via pre-commit or directly:
+
+```bash
+clang-format -i <file>          # format a single file in-place
+pre-commit run clang-format     # format all staged files
+```
+
 ## Testing
 
 ```bash
diff --git a/waveasm/AGENTS.md b/waveasm/AGENTS.md
index 43a3c852f5..35ec55a338 100644
--- a/waveasm/AGENTS.md
+++ b/waveasm/AGENTS.md
@@ -13,6 +13,15 @@ pip install -e ".[dev]"   # re-links extension, skips CMake
 
 Set `WAVE_WAVEASM_DIR=waveasm/build` after first build to avoid full rebuilds on pip install. CLI tool: `waveasm-translate`.
 
+## Formatting
+
+C++ code is formatted with `clang-format`. Run via pre-commit or directly:
+
+```bash
+clang-format -i <file>          # format a single file in-place
+pre-commit run clang-format     # format all staged files
+```
+
 ## Testing
 
 ```bash

From 8c9cd08268b640a1e7c7472594d1f62386fb4a66 Mon Sep 17 00:00:00 2001
From: Tim Gymnich <tim@gymni.ch>
Date: Wed, 8 Apr 2026 12:12:33 +0200
Subject: [PATCH 5/5] Update AGENTS.md files with build instructions and
 formatting guidance

- water/AGENTS.md: restructure Building section to clarify that Water
  must be built with CMake first, then pip install with WAVE_WATER_DIR;
  add full cmake configure/build commands and useful flags; note that
  ninja alone is sufficient for iterating after initial pip install;
  add git clang-format guidance; add lit location note; add Pipelines.cpp
  reference in Pass Pipeline section
- waveasm/AGENTS.md: add git clang-format guidance
- AGENTS.md: update pre-commit invocation; add AGENTS.local.md to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Gymnich <tim@gymni.ch>
---
 .gitignore        |  3 ++-
 AGENTS.md         |  5 ++---
 water/AGENTS.md   | 53 +++++++++++++++++++++++++++++++++++++----------
 waveasm/AGENTS.md |  7 ++++---
 4 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/.gitignore b/.gitignore
index e482958e5b..3c87fb3377 100644
--- a/.gitignore
+++ b/.gitignore
@@ -60,5 +60,6 @@ water/build_tools/wheel/water_mlir/water_mlir
 # rocm version detection
 requirements-pytorch-rocm-generated.txt
 
-# Claude
+# AI Agents
 CLAUDE.local.md
+AGENTS.local.md
diff --git a/AGENTS.md b/AGENTS.md
index e3685eed37..f7f1fcc34a 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -21,13 +21,12 @@ pytest -s tests/ --run-e2e                             # GPU tests (requires har
 
 ### Linting
 ```bash
-mypy                        # type check wave_lang
-pre-commit run --all-files  # Black, Ruff, clang-format
+mypy               # type check wave_lang
+pre-commit run     # run Black, Ruff, clang-format against currently staged files
 ```
 
 ### Gotchas
 - **Always set `WAVE_CACHE_ON=0`** when testing code changes — stale cache entries hide the effect of edits: `WAVE_CACHE_ON=0 pytest ...`
-- DCO sign-off required on commits: `git commit -s`
 - Dump MLIR for debugging: `pytest --dump-mlir-files-path=/tmp/mlir tests/`
 
 ## Architecture
diff --git a/water/AGENTS.md b/water/AGENTS.md
index 0c43cd491b..0c36da9975 100644
--- a/water/AGENTS.md
+++ b/water/AGENTS.md
@@ -2,26 +2,57 @@ Water is an optional MLIR layer in the Wave compiler stack that replaces IREE's
 
 ## Building
 
+Water must be built with CMake first. `pip install` alone does not build Water — `WAVE_WATER_DIR` is required to point Wave at an existing Water build.
+
+LLVM is pinned at `water/llvm-sha.txt`. CLI tool: `water-opt` (analogous to `mlir-opt`).
+
+### Step 1: Build Water with CMake
+
+Requires a pre-built LLVM/MLIR. Set `$BUILD_DIR` to your LLVM build or install tree.
+
 ```bash
-# First build — builds LLVM from source, takes a while
-WAVE_WATER_DIR=water/build pip install -e ".[dev]"
+# Configure
+cmake -G Ninja \
+      -B water/build \
+      water/ \
+      -DMLIR_DIR=$BUILD_DIR/lib/cmake/mlir \
+      -DBUILD_SHARED_LIBS=ON \
+      -DPython3_EXECUTABLE="$(which python)" \
+      -DWATER_ENABLE_PYTHON=ON
+
+# Optional: faster builds with clang + ccache + lld
+cmake -B water/build \
+      -DCMAKE_C_COMPILER=clang \
+      -DCMAKE_CXX_COMPILER=clang++ \
+      -DCMAKE_C_COMPILER_LAUNCHER=ccache \
+      -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
+      -DLLVM_USE_LINKER=lld
+
+# Build
+cmake --build water/build
+```
+
+### Step 2: Install Wave with Water bindings
 
-# Iterating on C++ changes
-ninja -C water/build          # rebuild changed targets only
-pip install -e ".[dev]"       # re-links Python extension (fast, skips CMake)
+```bash
+WAVE_WATER_DIR=water/build pip install -e ".[dev]"
 ```
 
-`WAVE_WATER_DIR` tells the Wave build system to use an existing build directory instead of rebuilding from scratch. Without it, the full LLVM + Water CMake build runs on every `pip install`.
+`WAVE_WATER_DIR` tells Wave where to find the Water build. Without it, Water is not included.
 
-LLVM is pinned at `water/llvm-sha.txt`. CLI tool: `water-opt` (analogous to `mlir-opt`).
+### Iterating on C++ changes
 
-## Formatting
+```bash
+ninja -C water/build          # rebuild changed C++ targets and Python bindings
+```
 
-C++ code is formatted with `clang-format`. Run via pre-commit or directly:
+## Formatting
 
+C++ code is formatted with `git clang-format` which formats only the lines changed relative to a commit (default: `HEAD`)
 ```bash
-clang-format -i <file>          # format a single file in-place
-pre-commit run clang-format     # format all staged files
+git clang-format                # format staged changes
+git clang-format HEAD~1         # also include most recent commit
+git clang-format main           # format everything touched on your branch
 ```
 
 ## Testing
diff --git a/waveasm/AGENTS.md b/waveasm/AGENTS.md
index 35ec55a338..e63b8172e8 100644
--- a/waveasm/AGENTS.md
+++ b/waveasm/AGENTS.md
@@ -15,11 +15,12 @@ Set `WAVE_WAVEASM_DIR=waveasm/build` after first build to avoid full rebuilds on
 
 ## Formatting
 
-C++ code is formatted with `clang-format`. Run via pre-commit or directly:
+C++ code is formatted with `git clang-format` which formats only the lines changed relative to a commit (default: `HEAD`)
 
 ```bash
-clang-format -i <file>          # format a single file in-place
-pre-commit run clang-format     # format all staged files
+git clang-format                # format staged changes
+git clang-format HEAD~1         # also include most recent commit
+git clang-format main           # format everything touched on your branch
 ```
 
 ## Testing