Chore(release): v6.13.5 (DO NOT MERGE)#9325
Draft
lstein wants to merge 2 commits into
Draft
Conversation
…room before decode/encode (#9305) * fix(qwen): estimate VAE working memory so the cache frees room before decode/encode The Qwen Image l2i/i2l invocations called `model_on_device()` without a `working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache therefore only reserved the default `device_working_mem_gb` and never evicted the resident transformer/text encoder before the VAE decode. On a near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder resident) the decode then OOMs trying to allocate its working set into the fragmented remainder. Add `estimate_vae_working_memory_qwen_image()` and pass it into both the decode and encode paths so the cache makes room (evicting other models when needed) before the operation runs. The constant is calibrated against a measured decode on an AMD W7900: at 1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved* (not just allocated) memory so that whenever the cache declines to free room (free >= estimate) the decode is still guaranteed to fit. Encode uses ~half, matching the other estimators (not independently measured). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(qwen): cover VAE working-memory estimate is passed to cache Address review feedback from @Pfannkuchensack on #9305: - Add test_qwen_image_working_memory.py mirroring the z-image pattern, asserting both decode and encode paths call model_on_device with the estimated working_mem_bytes (regression guard for the OOM fix). - Clarify the qwen estimator comment: the encode constant is not independently measured (half of decode, matching siblings' ratio) and should be recalibrated against a measured encode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(qwen): recalibrate VAE working-memory constants from a measured grid Add scripts/calibrate_qwen_vae_working_memory.py, a backend-portable (CUDA/ROCm) harness that measures peak reserved-memory growth for VAE decode/encode across a resolution grid, one fresh subprocess per point. Calibrating on an AMD W7900 (fp16) showed the encode constant was wrong: the previous 2750 ("half of decode") under-estimated by ~2x at every measured resolution, the exact OOM mode Qwen Image Edit (which encodes a real image) would hit. Raise encode 2750 -> 6300. Decode 5500 is confirmed safe across the full 512^2..2048^2 range and left unchanged. The grid also showed memory is super-linear in area above ~1792^2 (an attention term) and non-monotonic (likely an SDPA-backend crossover on ROCm); both documented in the estimator. Constants are the conservative ROCm side and will be max-merged with a pending NVIDIA/CUDA run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(qwen): branch VAE working-memory constants by backend (ROCm vs CUDA) Calibrating the same fp16 grid on an NVIDIA card showed CUDA reserves ~2x (decode) to ~4x (encode) less than ROCm: the Qwen VAE is attention- heavy, and CUDA's Flash/efficient attention is O(area) and flat while the ROCm math-attention fallback is O(area^2). The backends diverge far more than any headroom, so a single constant either under-estimates on ROCm (OOM) or massively over-budgets CUDA (needless eviction). Select constants via torch.version.hip: decode: ROCm 5500 / CUDA 2900 encode: ROCm 6300 / CUDA 1600 Each verified to cover its measured grid (19 points/backend) with ~8% headroom. The CUDA run also confirms the linear model holds with Flash attention (the ROCm super-linear/non-monotonic behavior is a math- attention artifact), and that "encode is half of decode" is CUDA-only. Add parametrized tests asserting the constant selected for each (operation, backend) so a refactor can't silently swap them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(backend): ruff * calibrate: support single-file Qwen Image VAE checkpoints The calibration script only loaded the Qwen VAE from a diffusers directory via from_pretrained, so passing a single .safetensors file failed. Add _load_vae, which loads a directory as before and handles a single-file checkpoint by loading the state dict directly: a strict load for the diffusers layout, falling back to convert_wan_vae_to_diffusers for the original Qwen-Image/Wan release layout (downsamples/residual/ time_conv keys) before retrying. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the working branch for v6.13.5. Do not merge until after the final release.
Related Issues / Discussions
QA Instructions
Merge Plan
Checklist
What's Newcopy (if doing a release after this PR)