iree-org · tgymnich · Apr 7, 2026 · Apr 7, 2026 · Apr 8, 2026 · Apr 8, 2026
diff --git a/.gitignore b/.gitignore
@@ -59,3 +59,7 @@ water/build_tools/wheel/water_mlir/water_mlir
 
 # rocm version detection
 requirements-pytorch-rocm-generated.txt
+
+# AI Agents
+CLAUDE.local.md
+AGENTS.local.md
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,75 @@
+Wave is a Python DSL for high-performance ML kernel development targeting AMD GPUs (ROCm). The default compilation path is pure Python using IREE for codegen. Water and WaveASM are optional C++ extensions that replace parts of the IREE path.
+
+## Commands
+
+### Setup
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements-iree-pinned.txt
+pip install -r pytorch-cpu-requirements.txt  # CPU-only dev/testing
+pip install -e ".[dev]"
+pre-commit install && pre-commit install --hook-type commit-msg
+```
+
+### Testing
+```bash
+pytest -n 4 --capture=tee-sys -vv ./tests/unittests/   # unit tests
+pytest -s tests/unittests/test_file.py::test_name -v   # single test
+lit lit_tests/ -vv                                     # MLIR LIT tests
+pytest -s tests/ --run-e2e                             # GPU tests (requires hardware)
+```
+
+### Linting
+```bash
+mypy               # type check wave_lang
+pre-commit run     # run Black, Ruff, clang-format against currently staged files
+```
+
+### Gotchas
+- **Always set `WAVE_CACHE_ON=0`** when testing code changes — stale cache entries hide the effect of edits: `WAVE_CACHE_ON=0 pytest ...`
+- Dump MLIR for debugging: `pytest --dump-mlir-files-path=/tmp/mlir tests/`
+
+## Architecture
+
+### Compilation Flow
+
+```
+Wave Python DSL
+    ↓  graph transformation passes  [wave_lang/kernel/wave/codegen/]
+Transformed FX graph
+    ↓  WaveEmitter  [compiler/wave_codegen/emitter.py]
+stream.executable MLIR
+    ↓  iree.compiler.compile_str()  [wave/utils/compile_utils.py]
+VMFB (IREE bytecode module)
+    ↓  iree.runtime.VmModule
+GPU kernel execution
+```
+
+Entry point: `wave_compile()` in `wave_lang/kernel/wave/compile.py`.
+
+### Runtimes
+
+**IREE runtime (default):** Loads VMFB into IREE's VM. Handles GPU command buffers, queue submission, benchmarking, multi-device.
+
+**Wave runtime (`options.wave_runtime=True`):** Launches HSACO kernels directly via HIP API. Supports dynamic strides and custom grid layout. Typically paired with WaveASM. Entry point: `invoke_with_wave_runtime()` in `wave_lang/kernel/wave/utils/run_utils.py`.
+
+### Key Source Locations
+
+- `wave_lang/kernel/wave/compile.py` — pipeline orchestration, backend/runtime selection
+- `wave_lang/kernel/wave/codegen/` — graph transformation passes (scheduling, barriers, index analysis)
+- `wave_lang/kernel/compiler/wave_codegen/emitter.py` — lowers FX graph to MLIR
+- `wave_lang/kernel/wave/water.py` — Water/WaveASM lowering pipeline entry points
+- `wave_lang/kernel/wave/mlir_converter/` — Wave FX ↔ Water MLIR conversion; runs in a subprocess to avoid MLIR library conflicts (Water backend only)
+
+### Optional Extensions
+
+Water and WaveASM intercept MLIR before IREE and produce HSACO directly. Enable via env vars:
+
+| Variable | Purpose |
+|---|---|
+| `WAVE_BUILD_WATER=1` | Build Water from source |
+| `WAVE_BUILD_WAVEASM=1` | Build WaveASM from source |
+| `WAVE_WATER_DIR=water/build` | Use existing Water build (fast) |
+| `WAVE_WAVEASM_DIR=waveasm/build` | Use existing WaveASM build (fast) |
+
+When both active: stream.executable MLIR → `water-opt` → `waveasm-translate` → `water-opt` → ExecutionEngine.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1 @@
+See @AGENTS.md
diff --git a/water/AGENTS.md b/water/AGENTS.md
@@ -0,0 +1,103 @@
+Water is an optional MLIR layer in the Wave compiler stack that replaces IREE's middle-end lowering. It defines the `wave.*` and `normalform.*` dialects, transformation passes, and Python bindings (`water_mlir` package).
+
+## Building
+
+Water must be built with CMake first. `pip install` alone does not build Water — `WAVE_WATER_DIR` is required to point Wave at an existing Water build.
+
+LLVM is pinned at `water/llvm-sha.txt`. CLI tool: `water-opt` (analogous to `mlir-opt`).
+
+### Step 1: Build Water with CMake
+
+Requires a pre-built LLVM/MLIR. Set `$BUILD_DIR` to your LLVM build or install tree.
+
+```bash
+# Configure
+cmake -G Ninja \
+      -B water/build \
+      water/ \
+      -DMLIR_DIR=$BUILD_DIR/lib/cmake/mlir \
+      -DBUILD_SHARED_LIBS=ON \
+      -DPython3_EXECUTABLE="$(which python)" \
+      -DWATER_ENABLE_PYTHON=ON
+
+# Optional: faster builds with clang + ccache + lld
+cmake -B water/build \
+      -DCMAKE_C_COMPILER=clang \
+      -DCMAKE_CXX_COMPILER=clang++ \
+      -DCMAKE_C_COMPILER_LAUNCHER=ccache \
+      -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
+      -DLLVM_USE_LINKER=lld
+
+# Build
+cmake --build water/build
+```
+
+### Step 2: Install Wave with Water bindings
+
+```bash
+WAVE_WATER_DIR=water/build pip install -e ".[dev]"
+```
+
+`WAVE_WATER_DIR` tells Wave where to find the Water build. Without it, Water is not included.
+
+### Iterating on C++ changes
+
+```bash
+ninja -C water/build          # rebuild changed C++ targets and Python bindings
+```
+
+## Formatting
+
+C++ code is formatted with `git clang-format` which formats only the lines changed relative to a commit (default: `HEAD`)
+```bash
+git clang-format                # format staged changes
+git clang-format HEAD~1         # also include most recent commit
+git clang-format main           # format everything touched on your branch
+```
+
+## Testing
+
+```bash
+ninja -C water/build check-water        # all lit tests
+lit test/Dialect/Wave/<test>.mlir -vv   # single test
+```
+
+Tests use lit + FileCheck. `.mlir` files use `// CHECK` comments. Negative tests are named `*-invalid.mlir`.
+
+## Architecture
+
+### Dialects
+
+**`wave.*`** — primary dialect. `wave.tensor` has symbolic shapes (unknown until inferred by passes) and an address space (`Global`, `Shared`, `Register`). Each op carries a `WaveIndexMappingAttr` encoding element distribution across device/workgroup/workitem/register dimensions as `(offset, count, step)` triples.
+
+**`normalform.*`** — `normalform.module` wraps IR and enforces declared invariants. Passes declare pre/post-conditions as normal form attributes, enabling composable pass ordering without new IR constructs.
+
+### Pass Pipeline
+
+`water-middle-end-lowering` runs these in order (`include/water/Dialect/Wave/Transforms/Passes.td`):
+
+| Pass | Purpose |
+|---|---|
+| `water-wave-detect-normal-forms` | Detect satisfied invariants |
+| `water-wave-infer-types` | Shape inference via dataflow |
+| `water-wave-infer-index-exprs` | Forward/backward index expression propagation |
+| `water-wave-propagate-elements-per-thread` | Replace register tensors with vector types |
+| `water-wave-resolve-distributed-allocations` | Map distributed shapes to concrete memref layouts |
+| `lower-wave-to-mlir` | Lower to arith/math/vector/memref dialects |
+| `lower-normalform-module` | Remove the normalform wrapper |
+
+Generic passes include SLP vectorization, bounds-checking assertions, alloc-to-alloca, and GPU module serialization (ROCDL).
+
+### Python Bindings
+
+Package `water_mlir` (prefixed to avoid IREE conflicts):
+- `water_mlir.dialects.wave` — auto-generated op bindings from `WaveOps.td`
+- `water_mlir.sympy_to_affine_converter` — converts SymPy expressions to MLIR affine expressions
+- C++ extension via nanobind (`WaterExtensionNanobind.cpp`)
+
+### Key Design Principles
+
+- **Lazy type inference**: `wave.tensor` shapes start unknown — don't assume they're set at construction.
+- **Elements-per-thread (EPT)**: tracked separately from types; required before register tensors can be lowered to vector types. A pass that changes element counts must update EPT.
+- **`water_mlir` prefix**: the Python package is prefixed to avoid conflicts with IREE's MLIR bindings. Import as `from water_mlir.dialects import wave`, not `mlir.dialects.wave`.
+- **subprocess isolation**: the Wave-side `mlir_converter` runs Water in a subprocess specifically to avoid MLIR library symbol clashes with IREE.
diff --git a/water/CLAUDE.md b/water/CLAUDE.md
@@ -0,0 +1 @@
+See @AGENTS.md
diff --git a/waveasm/AGENTS.md b/waveasm/AGENTS.md
@@ -0,0 +1,60 @@
+WaveASM is an optional C++ backend in the Wave compiler stack that replaces IREE's GPU codegen. It translates MLIR into AMDGCN assembly for AMD GPUs (gfx942/CDNA3, gfx950/CDNA3.5, gfx1250/RDNA4) and produces `.hsaco` binaries via its own `waveasm.*` MLIR dialect, linear-scan register allocator, and assembly emitter.
+
+## Building
+
+```bash
+# First build
+WAVE_BUILD_WAVEASM=1 pip install -e ".[dev]"
+
+# Iterating on C++ changes (same pattern as Water)
+ninja -C waveasm/build
+pip install -e ".[dev]"   # re-links extension, skips CMake
+```
+
+Set `WAVE_WAVEASM_DIR=waveasm/build` after first build to avoid full rebuilds on pip install. CLI tool: `waveasm-translate`.
+
+## Formatting
+
+C++ code is formatted with `git clang-format` which formats only the lines changed relative to a commit (default: `HEAD`)
+
+```bash
+git clang-format                # format staged changes
+git clang-format HEAD~1         # also include most recent commit
+git clang-format main           # format everything touched on your branch
+```
+
+## Testing
+
+```bash
+ninja -C waveasm/build check-waveasm      # lit regression tests
+ninja -C waveasm/build check-waveasm-all  # + GPU functional tests (requires hardware)
+lit test/Transforms/<test>.mlir -vv       # single test
+```
+
+## Architecture
+
+### Compilation Pipeline
+
+```
+Input MLIR (gpu, arith, vector, memref, scf, amdgpu dialects)
+    ↓  TranslateFromMLIR  [lib/Transforms/TranslateFromMLIR.cpp]
+WaveASM IR (virtual registers, pseudo-ops)
+    ↓  ScopedCSE, Peephole, BufferLoadStrengthReduction
+    ↓  ArithLegalization
+Concrete SALU/VALU machine ops
+    ↓  Liveness → LinearScanRegAlloc → VGPRCompaction
+Physical register assignments
+    ↓  Ticketing, HazardMitigation
+    ↓  AssemblyEmitter → clang++
+.hsaco GPU binary
+```
+
+### Dialect
+
+Types (`WaveASMTypes.td`): virtual (`!waveasm.vreg/sreg/areg`) and physical (`!waveasm.pvreg/psreg/pareg`) register types, plus `!waveasm.imm` and `!waveasm.scc`. The two-phase virtual→physical split is intentional — optimization passes run on virtual SSA, allocation happens once at the end.
+
+~300 machine ops in `WaveASMOps.td`: VALU, SALU, MFMA, memory (global/LDS/SMEM), control flow, and utility ops. Pseudo-ops (`waveasm.arith.*`) exist for cases where the concrete instruction depends on register class — ArithLegalization resolves them.
+
+### Adding New Dialect Support
+
+`TranslateFromMLIR` uses a handler registry. To translate a new upstream op, add a handler to the appropriate file in `lib/Transforms/handlers/` and register it in the `TranslationContext`. The `TranslationContext` also manages the SRD (Shader Resource Descriptor) table and expression cache — use it rather than tracking state locally in handlers.
diff --git a/waveasm/CLAUDE.md b/waveasm/CLAUDE.md
@@ -0,0 +1 @@
+See @AGENTS.md