Skip to content

Chore(release): v6.13.5 (DO NOT MERGE)#9325

Draft
lstein wants to merge 2 commits into
mainfrom
lstein/chore/v6.13.5
Draft

Chore(release): v6.13.5 (DO NOT MERGE)#9325
lstein wants to merge 2 commits into
mainfrom
lstein/chore/v6.13.5

Conversation

@lstein

@lstein lstein commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

This is the working branch for v6.13.5. Do not merge until after the final release.

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

lstein and others added 2 commits June 29, 2026 20:48
…room before decode/encode (#9305)

* fix(qwen): estimate VAE working memory so the cache frees room before decode/encode

The Qwen Image l2i/i2l invocations called `model_on_device()` without a
`working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache
therefore only reserved the default `device_working_mem_gb` and never
evicted the resident transformer/text encoder before the VAE decode. On a
near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder
resident) the decode then OOMs trying to allocate its working set into the
fragmented remainder.

Add `estimate_vae_working_memory_qwen_image()` and pass it into both the
decode and encode paths so the cache makes room (evicting other models when
needed) before the operation runs.

The constant is calibrated against a measured decode on an AMD W7900: at
1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied
constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved*
(not just allocated) memory so that whenever the cache declines to free room
(free >= estimate) the decode is still guaranteed to fit. Encode uses ~half,
matching the other estimators (not independently measured).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(qwen): cover VAE working-memory estimate is passed to cache

Address review feedback from @Pfannkuchensack on #9305:
- Add test_qwen_image_working_memory.py mirroring the z-image pattern,
  asserting both decode and encode paths call model_on_device with the
  estimated working_mem_bytes (regression guard for the OOM fix).
- Clarify the qwen estimator comment: the encode constant is not
  independently measured (half of decode, matching siblings' ratio) and
  should be recalibrated against a measured encode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(qwen): recalibrate VAE working-memory constants from a measured grid

Add scripts/calibrate_qwen_vae_working_memory.py, a backend-portable
(CUDA/ROCm) harness that measures peak reserved-memory growth for VAE
decode/encode across a resolution grid, one fresh subprocess per point.

Calibrating on an AMD W7900 (fp16) showed the encode constant was wrong:
the previous 2750 ("half of decode") under-estimated by ~2x at every
measured resolution, the exact OOM mode Qwen Image Edit (which encodes a
real image) would hit. Raise encode 2750 -> 6300. Decode 5500 is confirmed
safe across the full 512^2..2048^2 range and left unchanged.

The grid also showed memory is super-linear in area above ~1792^2 (an
attention term) and non-monotonic (likely an SDPA-backend crossover on
ROCm); both documented in the estimator. Constants are the conservative
ROCm side and will be max-merged with a pending NVIDIA/CUDA run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(qwen): branch VAE working-memory constants by backend (ROCm vs CUDA)

Calibrating the same fp16 grid on an NVIDIA card showed CUDA reserves
~2x (decode) to ~4x (encode) less than ROCm: the Qwen VAE is attention-
heavy, and CUDA's Flash/efficient attention is O(area) and flat while the
ROCm math-attention fallback is O(area^2). The backends diverge far more
than any headroom, so a single constant either under-estimates on ROCm
(OOM) or massively over-budgets CUDA (needless eviction).

Select constants via torch.version.hip:
  decode: ROCm 5500 / CUDA 2900
  encode: ROCm 6300 / CUDA 1600
Each verified to cover its measured grid (19 points/backend) with ~8%
headroom. The CUDA run also confirms the linear model holds with Flash
attention (the ROCm super-linear/non-monotonic behavior is a math-
attention artifact), and that "encode is half of decode" is CUDA-only.

Add parametrized tests asserting the constant selected for each
(operation, backend) so a refactor can't silently swap them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(backend): ruff

* calibrate: support single-file Qwen Image VAE checkpoints

The calibration script only loaded the Qwen VAE from a diffusers
directory via from_pretrained, so passing a single .safetensors file
failed. Add _load_vae, which loads a directory as before and handles a
single-file checkpoint by loading the state dict directly: a strict load
for the diffusers layout, falling back to convert_wan_vae_to_diffusers
for the original Qwen-Image/Wan release layout (downsamples/residual/
time_conv keys) before retrying.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev>
@lstein lstein marked this pull request as draft July 1, 2026 02:26
@github-actions github-actions Bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files python-tests PRs that change python tests labels Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend PRs that change backend files invocations PRs that change invocations python PRs that change python files python-tests PRs that change python tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant