From b70f81f5d849df61416af88b8369ed24920491db Mon Sep 17 00:00:00 2001
From: Lincoln Stein <lincoln.stein@gmail.com>
Date: Mon, 29 Jun 2026 20:46:41 -0400
Subject: [PATCH 1/5] chore(release): bump version to v6.13.5.rc1

---
 invokeai/version/invokeai_version.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/invokeai/version/invokeai_version.py b/invokeai/version/invokeai_version.py
index 4007688a098..1d8d95d35c6 100644
--- a/invokeai/version/invokeai_version.py
+++ b/invokeai/version/invokeai_version.py
@@ -1 +1 @@
-__version__ = "6.13.0.post1"
+__version__ = "6.13.5.rc1"

From fd581476c64d3ddc85250c19eccb69fdb94fb31e Mon Sep 17 00:00:00 2001
From: Lincoln Stein <lincoln.stein@gmail.com>
Date: Tue, 30 Jun 2026 22:22:47 -0400
Subject: [PATCH 2/5] fix(qwen): estimate Qwen Image VAE working memory so the
 cache frees room before decode/encode (#9305)

* fix(qwen): estimate VAE working memory so the cache frees room before decode/encode

The Qwen Image l2i/i2l invocations called `model_on_device()` without a
`working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache
therefore only reserved the default `device_working_mem_gb` and never
evicted the resident transformer/text encoder before the VAE decode. On a
near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder
resident) the decode then OOMs trying to allocate its working set into the
fragmented remainder.

Add `estimate_vae_working_memory_qwen_image()` and pass it into both the
decode and encode paths so the cache makes room (evicting other models when
needed) before the operation runs.

The constant is calibrated against a measured decode on an AMD W7900: at
1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied
constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved*
(not just allocated) memory so that whenever the cache declines to free room
(free >= estimate) the decode is still guaranteed to fit. Encode uses ~half,
matching the other estimators (not independently measured).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(qwen): cover VAE working-memory estimate is passed to cache

Address review feedback from @Pfannkuchensack on #9305:
- Add test_qwen_image_working_memory.py mirroring the z-image pattern,
  asserting both decode and encode paths call model_on_device with the
  estimated working_mem_bytes (regression guard for the OOM fix).
- Clarify the qwen estimator comment: the encode constant is not
  independently measured (half of decode, matching siblings' ratio) and
  should be recalibrated against a measured encode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(qwen): recalibrate VAE working-memory constants from a measured grid

Add scripts/calibrate_qwen_vae_working_memory.py, a backend-portable
(CUDA/ROCm) harness that measures peak reserved-memory growth for VAE
decode/encode across a resolution grid, one fresh subprocess per point.

Calibrating on an AMD W7900 (fp16) showed the encode constant was wrong:
the previous 2750 ("half of decode") under-estimated by ~2x at every
measured resolution, the exact OOM mode Qwen Image Edit (which encodes a
real image) would hit. Raise encode 2750 -> 6300. Decode 5500 is confirmed
safe across the full 512^2..2048^2 range and left unchanged.

The grid also showed memory is super-linear in area above ~1792^2 (an
attention term) and non-monotonic (likely an SDPA-backend crossover on
ROCm); both documented in the estimator. Constants are the conservative
ROCm side and will be max-merged with a pending NVIDIA/CUDA run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(qwen): branch VAE working-memory constants by backend (ROCm vs CUDA)

Calibrating the same fp16 grid on an NVIDIA card showed CUDA reserves
~2x (decode) to ~4x (encode) less than ROCm: the Qwen VAE is attention-
heavy, and CUDA's Flash/efficient attention is O(area) and flat while the
ROCm math-attention fallback is O(area^2). The backends diverge far more
than any headroom, so a single constant either under-estimates on ROCm
(OOM) or massively over-budgets CUDA (needless eviction).

Select constants via torch.version.hip:
  decode: ROCm 5500 / CUDA 2900
  encode: ROCm 6300 / CUDA 1600
Each verified to cover its measured grid (19 points/backend) with ~8%
headroom. The CUDA run also confirms the linear model holds with Flash
attention (the ROCm super-linear/non-monotonic behavior is a math-
attention artifact), and that "encode is half of decode" is CUDA-only.

Add parametrized tests asserting the constant selected for each
(operation, backend) so a refactor can't silently swap them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(backend): ruff

* calibrate: support single-file Qwen Image VAE checkpoints

The calibration script only loaded the Qwen VAE from a diffusers
directory via from_pretrained, so passing a single .safetensors file
failed. Add _load_vae, which loads a directory as before and handles a
single-file checkpoint by loading the state dict directly: a strict load
for the diffusers layout, falling back to convert_wan_vae_to_diffusers
for the original Qwen-Image/Wan release layout (downsamples/residual/
time_conv keys) before retrying.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev>
---
 .../qwen_image_image_to_latents.py            |   9 +-
 .../qwen_image_latents_to_image.py            |   8 +-
 invokeai/backend/util/vae_working_memory.py   |  52 +++
 scripts/calibrate_qwen_vae_working_memory.py  | 305 ++++++++++++++++++
 .../test_qwen_image_working_memory.py         | 136 ++++++++
 5 files changed, 508 insertions(+), 2 deletions(-)
 create mode 100644 scripts/calibrate_qwen_vae_working_memory.py
 create mode 100644 tests/app/invocations/test_qwen_image_working_memory.py

diff --git a/invokeai/app/invocations/qwen_image_image_to_latents.py b/invokeai/app/invocations/qwen_image_image_to_latents.py
index ef88e03082b..ffae5470f68 100644
--- a/invokeai/app/invocations/qwen_image_image_to_latents.py
+++ b/invokeai/app/invocations/qwen_image_image_to_latents.py
@@ -18,6 +18,7 @@
 from invokeai.backend.model_manager.load.load_base import LoadedModel
 from invokeai.backend.stable_diffusion.diffusers_pipeline import image_resized_to_grid_as_tensor
 from invokeai.backend.util.devices import TorchDevice
+from invokeai.backend.util.vae_working_memory import estimate_vae_working_memory_qwen_image
 
 
 @invocation(
@@ -44,7 +45,13 @@ class QwenImageImageToLatentsInvocation(BaseInvocation, WithMetadata, WithBoard)
 
     @staticmethod
     def vae_encode(vae_info: LoadedModel, image_tensor: torch.Tensor) -> torch.Tensor:
-        with vae_info.model_on_device() as (_, vae):
+        assert isinstance(vae_info.model, AutoencoderKLQwenImage)
+        estimated_working_memory = estimate_vae_working_memory_qwen_image(
+            operation="encode",
+            image_tensor=image_tensor,
+            vae=vae_info.model,
+        )
+        with vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
             assert isinstance(vae, AutoencoderKLQwenImage)
 
             vae.disable_tiling()
diff --git a/invokeai/app/invocations/qwen_image_latents_to_image.py b/invokeai/app/invocations/qwen_image_latents_to_image.py
index b3ea39c4bbf..072185f147b 100644
--- a/invokeai/app/invocations/qwen_image_latents_to_image.py
+++ b/invokeai/app/invocations/qwen_image_latents_to_image.py
@@ -19,6 +19,7 @@
 from invokeai.app.services.shared.invocation_context import InvocationContext
 from invokeai.backend.stable_diffusion.extensions.seamless import SeamlessExt
 from invokeai.backend.util.devices import TorchDevice
+from invokeai.backend.util.vae_working_memory import estimate_vae_working_memory_qwen_image
 
 
 @invocation(
@@ -41,9 +42,14 @@ def invoke(self, context: InvocationContext) -> ImageOutput:
 
         vae_info = context.models.load(self.vae.vae)
         assert isinstance(vae_info.model, AutoencoderKLQwenImage)
+        estimated_working_memory = estimate_vae_working_memory_qwen_image(
+            operation="decode",
+            image_tensor=latents,
+            vae=vae_info.model,
+        )
         with (
             SeamlessExt.static_patch_model(vae_info.model, self.vae.seamless_axes),
-            vae_info.model_on_device() as (_, vae),
+            vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae),
         ):
             context.util.signal_progress("Running VAE")
             assert isinstance(vae, AutoencoderKLQwenImage)
diff --git a/invokeai/backend/util/vae_working_memory.py b/invokeai/backend/util/vae_working_memory.py
index f9228ced652..57d558c03b4 100644
--- a/invokeai/backend/util/vae_working_memory.py
+++ b/invokeai/backend/util/vae_working_memory.py
@@ -2,6 +2,7 @@
 
 import torch
 from diffusers.models.autoencoders.autoencoder_kl import AutoencoderKL
+from diffusers.models.autoencoders.autoencoder_kl_qwenimage import AutoencoderKLQwenImage
 from diffusers.models.autoencoders.autoencoder_tiny import AutoencoderTiny
 
 from invokeai.app.invocations.constants import LATENT_SCALE_FACTOR
@@ -92,6 +93,57 @@ def estimate_vae_working_memory_flux(
     return int(working_memory)
 
 
+def estimate_vae_working_memory_qwen_image(
+    operation: Literal["encode", "decode"], image_tensor: torch.Tensor, vae: AutoencoderKLQwenImage
+) -> int:
+    """Estimate the working memory required by the invocation in bytes.
+
+    The Qwen Image VAE is a video-style autoencoder that operates on 5D tensors of shape
+    (B, C, num_frames, H, W). Tiling is not used, so peak working memory scales with the full
+    spatial output. The two trailing dimensions are the spatial H/W in latent space (decode) or
+    pixel space (encode), matching the convention used by the other estimators here.
+    """
+    latent_scale_factor_for_operation = LATENT_SCALE_FACTOR if operation == "decode" else 1
+
+    h = latent_scale_factor_for_operation * image_tensor.shape[-2]
+    w = latent_scale_factor_for_operation * image_tensor.shape[-1]
+    element_size = next(vae.parameters()).element_size()
+
+    # The Qwen Image VAE is much heavier than the SD/SDXL VAE and needs correspondingly larger
+    # constants. These were calibrated by measuring peak *reserved* memory growth (not just allocated
+    # -- reserved is what the cache's `free >= estimate` check compares against) across a resolution
+    # grid in fp16, on both an AMD W7900 (ROCm) and an NVIDIA card (CUDA). See
+    # scripts/calibrate_qwen_vae_working_memory.py.
+    #
+    # Implied constant = reserved_bytes / (h * w * element_size). Per-point maxima (fp16):
+    #              512^2  768^2  1024^2  1536^2  1792^2  2048^2    -> ship (max observed + ~8% headroom)
+    #   ROCm decode  5132   4596   4570    3273    3735    4813    -> 5500
+    #   ROCm encode  5864   5858   5858    3532    4364   (OOM)    -> 6300
+    #   CUDA decode  2660   2519   2690    2671    2281   (OOM)    -> 2900
+    #   CUDA encode  1456   1451   1458    1456    1455    1455    -> 1600
+    #
+    # Why this branches on backend (the only estimator here that does):
+    #  - The Qwen VAE is attention-heavy. With Flash/efficient attention (CUDA) the attention memory
+    #    is O(area) and the curve is flat/linear; the ROCm build falls back to math attention, which
+    #    is O(area^2), so ROCm reserves ~2x (decode) to ~4x (encode) more and goes super-linear above
+    #    ~1792^2. The two backends differ far more than any headroom, so a single constant would
+    #    either under-estimate on ROCm (OOM) or massively over-budget on CUDA (needless eviction).
+    #  - "Encoding is half of decoding" (as the sibling estimators assume) is only true on CUDA. On
+    #    ROCm encode reserves >= decode, so the ROCm encode constant is sized accordingly -- this is
+    #    the path Qwen Image Edit exercises.
+    #  - On ROCm the linear model under-estimates for decodes well above 2048^2, but those OOM on a
+    #    48GB card regardless; on CUDA the curve stays linear so no extra term is needed.
+    is_rocm = torch.version.hip is not None
+    if operation == "decode":
+        scaling_constant = 5500 if is_rocm else 2900
+    else:  # encode
+        scaling_constant = 6300 if is_rocm else 1600
+
+    working_memory = h * w * element_size * scaling_constant
+
+    return int(working_memory)
+
+
 def estimate_vae_working_memory_sd3(
     operation: Literal["encode", "decode"], image_tensor: torch.Tensor, vae: AutoencoderKL
 ) -> int:
diff --git a/scripts/calibrate_qwen_vae_working_memory.py b/scripts/calibrate_qwen_vae_working_memory.py
new file mode 100644
index 00000000000..2f81752aece
--- /dev/null
+++ b/scripts/calibrate_qwen_vae_working_memory.py
@@ -0,0 +1,305 @@
+"""Calibrate the Qwen Image VAE working-memory estimate against measured peak CUDA/HIP memory.
+
+Background
+----------
+``estimate_vae_working_memory_qwen_image`` models peak working memory as a linear function of
+spatial area::
+
+    working_memory = h * w * element_size * scaling_constant
+
+This script measures the *actual* peak reserved memory the VAE consumes during decode/encode across
+a grid of resolutions so the ``scaling_constant`` can be fit from several points instead of one, and
+so we can check whether the pure-linear model holds or whether a super-linear (attention) term
+appears at high resolution.
+
+The estimate is consumed by the model cache via ``free >= estimate`` to decide whether to evict, so
+it MUST be an upper bound: we measure peak *reserved* (not just allocated) memory, the conservative
+quantity that includes caching-allocator overhead and kernel scratch/workspace.
+
+Portability
+-----------
+Backend-agnostic: uses only ``torch.cuda.*``, which works on both NVIDIA/CUDA and AMD/ROCm (HIP)
+builds of PyTorch. Run the SAME script on each backend and compare the ``implied_constant`` columns
+-- the curve *shape* (linear vs. super-linear) is architectural and should match, but the absolute
+constant can differ (cuDNN vs. MIOpen conv workspaces, flash-attention availability, allocator
+rounding). Ship ``max`` across backends plus headroom.
+
+Each (operation, resolution) point is measured in a FRESH SUBPROCESS so the caching allocator's
+fragmentation history from earlier points cannot contaminate the reserved-delta reading. A point
+that OOMs is recorded as ``oom`` rather than aborting the run, so the grid can probe up to the
+card's ceiling safely.
+
+Usage
+-----
+    python scripts/calibrate_qwen_vae_working_memory.py [--vae /path/to/vae_dir] [--csv out.csv]
+
+If ``--vae`` is omitted, the script auto-discovers an ``AutoencoderKLQwenImage`` under
+``$INVOKEAI_ROOT/models``.
+"""
+
+import argparse
+import json
+import os
+import subprocess
+import sys
+from pathlib import Path
+
+import torch
+from diffusers.models.autoencoders.autoencoder_kl_qwenimage import AutoencoderKLQwenImage
+
+LATENT_SCALE_FACTOR = 8
+
+# (height, width) pixel-space resolutions. Squares to test linearity in area, plus non-square
+# points (incl. the original 1248x832 calibration point) to confirm area = h*w is the right
+# predictor rather than max(h, w) or perimeter. Subprocess isolation + OOM capture means we can
+# list aggressive resolutions; ones that don't fit are simply recorded as oom.
+DEFAULT_RESOLUTIONS = [
+    (512, 512),
+    (768, 768),
+    (832, 1248),  # original single calibration point (as HxW)
+    (1024, 1024),
+    (1088, 1920),
+    (1280, 1280),
+    (1536, 1024),
+    (1536, 1536),
+    (1792, 1792),
+    (2048, 2048),
+]
+
+
+def discover_vae() -> Path:
+    """Find an AutoencoderKLQwenImage VAE directory under $INVOKEAI_ROOT/models."""
+    root = os.environ.get("INVOKEAI_ROOT")
+    if not root:
+        raise SystemExit("INVOKEAI_ROOT not set; pass --vae explicitly.")
+    models = Path(root) / "models"
+    for config_path in models.glob("*/vae/config.json"):
+        try:
+            cfg = json.loads(config_path.read_text())
+        except Exception:
+            continue
+        if cfg.get("_class_name") == "AutoencoderKLQwenImage":
+            return config_path.parent
+    raise SystemExit(f"No AutoencoderKLQwenImage VAE found under {models}; pass --vae explicitly.")
+
+
+DTYPES = {"float16": torch.float16, "bfloat16": torch.bfloat16, "float32": torch.float32}
+
+
+def _load_vae(vae_path: str, dtype: torch.dtype) -> AutoencoderKLQwenImage:
+    """Load an AutoencoderKLQwenImage from either a diffusers directory or a single .safetensors file.
+
+    Directory: standard ``from_pretrained``.
+    Single file: ``AutoencoderKLQwenImage`` has no single-file converter registered in diffusers,
+    so we instantiate with the default config and load the state dict directly. Two on-disk layouts
+    exist: the diffusers layout (``encoder.conv_in`` / ``down_blocks`` / ``mid_block`` keys, e.g. the
+    weights InvokeAI's VAELoader consumes) and the original Qwen-Image/Wan release layout
+    (``encoder.conv1`` / ``downsamples`` / ``middle`` / ``time_conv`` keys). We try a direct strict
+    load first, and on a key mismatch fall back to diffusers' Wan VAE converter -- the Qwen-Image VAE
+    shares the Wan VAE key structure -- before retrying.
+    """
+    path = Path(vae_path)
+    if not path.is_file():
+        return AutoencoderKLQwenImage.from_pretrained(vae_path, local_files_only=True, torch_dtype=dtype)
+
+    from safetensors.torch import load_file
+
+    sd = load_file(str(path))
+    for k in list(sd.keys()):
+        if sd[k].is_floating_point():
+            sd[k] = sd[k].to(dtype)
+
+    vae = AutoencoderKLQwenImage()
+    try:
+        # diffusers-layout checkpoint: keys already match the model. State dict was converted to
+        # `dtype` above and is assigned in place, so params carry the correct dtype.
+        vae.load_state_dict(sd, strict=True, assign=True)
+    except RuntimeError:
+        # Original Qwen-Image/Wan release layout: convert keys to the diffusers layout, then retry.
+        from diffusers.loaders.single_file_utils import convert_wan_vae_to_diffusers
+
+        converted = convert_wan_vae_to_diffusers(sd)
+        for k in list(converted.keys()):
+            if converted[k].is_floating_point():
+                converted[k] = converted[k].to(dtype)
+        vae.load_state_dict(converted, strict=True, assign=True)
+    return vae
+
+
+def _build_input(operation: str, h: int, w: int, z_dim: int, dtype: torch.dtype) -> torch.Tensor:
+    """Construct the 5D (B, C, num_frames, H, W) input the invocation feeds the VAE.
+
+    decode: latents at latent resolution (H/8, W/8) with z_dim channels.
+    encode: image at pixel resolution (H, W) with 3 channels.
+    These mirror QwenImageLatentsToImageInvocation / QwenImageImageToLatentsInvocation exactly.
+    """
+    device = torch.device("cuda")
+    if operation == "decode":
+        return torch.randn(1, z_dim, 1, h // LATENT_SCALE_FACTOR, w // LATENT_SCALE_FACTOR, device=device, dtype=dtype)
+    return torch.randn(1, 3, 1, h, w, device=device, dtype=dtype)
+
+
+@torch.inference_mode()
+def measure_one(vae_path: str, operation: str, h: int, w: int, dtype: torch.dtype) -> dict:
+    """Measure peak reserved-memory growth for a single decode/encode. Runs in a child process."""
+    vae = _load_vae(vae_path, dtype)
+    vae.to("cuda")
+    vae.disable_tiling()  # Qwen invocations never tile; match that.
+
+    param = next(vae.parameters())
+    dtype = param.dtype
+    element_size = param.element_size()
+    z_dim = int(vae.config.z_dim)
+
+    x = _build_input(operation, h, w, z_dim, dtype)
+
+    torch.cuda.synchronize()
+    torch.cuda.empty_cache()
+    torch.cuda.reset_peak_memory_stats()
+    baseline_reserved = torch.cuda.memory_reserved()
+
+    # Measure the COLD first call -- it includes conv-algorithm-search / attention workspace
+    # allocation, which is exactly what the real (single-shot) invocation pays.
+    try:
+        if operation == "decode":
+            vae.decode(x, return_dict=False)
+        else:
+            vae.encode(x).latent_dist.mode()
+        torch.cuda.synchronize()
+    except (torch.cuda.OutOfMemoryError, RuntimeError) as e:
+        if "out of memory" not in str(e).lower():
+            raise
+        return {"operation": operation, "h": h, "w": w, "oom": True}
+
+    peak_reserved = torch.cuda.max_memory_reserved()
+    peak_allocated = torch.cuda.max_memory_allocated()
+    reserved_delta = peak_reserved - baseline_reserved
+
+    area = h * w
+    return {
+        "operation": operation,
+        "h": h,
+        "w": w,
+        "area": area,
+        "element_size": element_size,
+        "dtype": str(dtype),
+        "reserved_delta": reserved_delta,
+        "allocated_peak": peak_allocated,
+        "reserved_baseline": baseline_reserved,
+        # The constant as the estimator parameterizes it: mem = area * element_size * k
+        "implied_constant": reserved_delta / (area * element_size),
+        "oom": False,
+    }
+
+
+def run_grid(vae_path: str, resolutions: list[tuple[int, int]], dtype_name: str, csv_path: Path | None) -> None:
+    rows: list[dict] = []
+    print(f"VAE: {vae_path}")
+    print(
+        f"torch {torch.__version__} | device {torch.cuda.get_device_name(0)} | hip={torch.version.hip} | dtype={dtype_name}\n"
+    )
+    print(f"{'op':6} {'HxW':>11} {'area':>10} {'reserved(GiB)':>14} {'alloc(GiB)':>11} {'implied_k':>10}")
+    print("-" * 70)
+
+    for operation in ("decode", "encode"):
+        for h, w in resolutions:
+            # Fresh subprocess per point for an uncontaminated reserved-memory reading.
+            proc = subprocess.run(
+                [
+                    sys.executable,
+                    __file__,
+                    "--single",
+                    operation,
+                    str(h),
+                    str(w),
+                    "--vae",
+                    vae_path,
+                    "--dtype",
+                    dtype_name,
+                ],
+                capture_output=True,
+                text=True,
+            )
+            line = proc.stdout.strip().splitlines()[-1] if proc.stdout.strip() else ""
+            try:
+                row = json.loads(line)
+            except Exception:
+                print(f"{operation:6} {f'{h}x{w}':>11}  FAILED: {proc.stderr.strip().splitlines()[-1:]}")
+                continue
+            rows.append(row)
+            if row.get("oom"):
+                print(f"{operation:6} {f'{h}x{w}':>11} {h * w:>10}  {'OOM':>14}")
+                continue
+            gib = 1024**3
+            print(
+                f"{operation:6} {f'{h}x{w}':>11} {row['area']:>10} "
+                f"{row['reserved_delta'] / gib:>14.3f} {row['allocated_peak'] / gib:>11.3f} "
+                f"{row['implied_constant']:>10.1f}"
+            )
+
+    # Summary: the shippable constant is the MAX implied constant over fitting points (upper bound).
+    print("\n=== summary (max implied constant = candidate scaling_constant, before headroom) ===")
+    for operation in ("decode", "encode"):
+        ks = [r["implied_constant"] for r in rows if r["operation"] == operation and not r.get("oom")]
+        if ks:
+            print(
+                f"{operation:6}: n={len(ks)}  min_k={min(ks):.1f}  max_k={max(ks):.1f}  "
+                f"-> use >= {max(ks):.0f} (+headroom)"
+            )
+
+    if csv_path:
+        import csv
+
+        fieldnames = [
+            "operation",
+            "h",
+            "w",
+            "area",
+            "element_size",
+            "dtype",
+            "reserved_delta",
+            "allocated_peak",
+            "reserved_baseline",
+            "implied_constant",
+            "oom",
+        ]
+        with csv_path.open("w", newline="") as f:
+            writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
+            writer.writeheader()
+            for r in rows:
+                writer.writerow(r)
+        print(f"\nWrote {csv_path}")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    parser.add_argument(
+        "--vae",
+        type=str,
+        default=None,
+        help="Path to an AutoencoderKLQwenImage diffusers dir OR a single .safetensors checkpoint.",
+    )
+    parser.add_argument("--csv", type=str, default=None, help="Optional path to write the raw results as CSV.")
+    parser.add_argument(
+        "--dtype",
+        choices=list(DTYPES),
+        default="float16",
+        help="Compute dtype. Default float16 to match InvokeAI's default precision on CUDA/ROCm.",
+    )
+    # Internal: measure a single point in this process and print one JSON line.
+    parser.add_argument("--single", nargs=3, metavar=("OP", "H", "W"), default=None, help=argparse.SUPPRESS)
+    args = parser.parse_args()
+
+    vae_path = args.vae or str(discover_vae())
+    dtype = DTYPES[args.dtype]
+
+    if args.single:
+        op, h, w = args.single[0], int(args.single[1]), int(args.single[2])
+        print(json.dumps(measure_one(vae_path, op, h, w, dtype)))
+        return
+
+    run_grid(vae_path, DEFAULT_RESOLUTIONS, args.dtype, Path(args.csv) if args.csv else None)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/app/invocations/test_qwen_image_working_memory.py b/tests/app/invocations/test_qwen_image_working_memory.py
new file mode 100644
index 00000000000..f3dfbe970df
--- /dev/null
+++ b/tests/app/invocations/test_qwen_image_working_memory.py
@@ -0,0 +1,136 @@
+"""Test that Qwen Image VAE invocations properly estimate and request working memory."""
+
+from contextlib import nullcontext
+from unittest.mock import MagicMock, patch
+
+import pytest
+import torch
+from diffusers.models.autoencoders.autoencoder_kl_qwenimage import AutoencoderKLQwenImage
+
+from invokeai.app.invocations.qwen_image_image_to_latents import QwenImageImageToLatentsInvocation
+from invokeai.app.invocations.qwen_image_latents_to_image import QwenImageLatentsToImageInvocation
+from invokeai.backend.util.vae_working_memory import estimate_vae_working_memory_qwen_image
+
+
+class TestQwenImageWorkingMemoryEstimate:
+    """Lock in the per-backend scaling constants calibrated in scripts/calibrate_qwen_vae_working_memory.py.
+
+    These differ by backend because the Qwen VAE is attention-heavy: ROCm falls back to math attention
+    (O(area^2), much higher memory) while CUDA uses Flash/efficient attention. A regression that swaps
+    the constants would reintroduce the ROCm OOM (under-estimate) or needlessly over-budget CUDA.
+    """
+
+    # (operation, latent_h, latent_w) -> the estimator scales pixel area (latent * 8 for decode,
+    # raw for encode) by element_size and the constant.
+    @pytest.mark.parametrize(
+        "operation, is_rocm, expected_constant",
+        [
+            ("decode", True, 5500),
+            ("decode", False, 2900),
+            ("encode", True, 6300),
+            ("encode", False, 1600),
+        ],
+    )
+    def test_constant_selected_per_backend(self, operation, is_rocm, expected_constant):
+        mock_vae = MagicMock(spec=AutoencoderKLQwenImage)
+        mock_vae.parameters.return_value = iter([torch.zeros(1, dtype=torch.float16)])  # element_size == 2
+
+        # decode receives latents (pixel area = latent area * 8^2); encode receives a pixel image.
+        if operation == "decode":
+            image_tensor = torch.zeros(1, 16, 1, 64, 64)
+            h = w = 64 * 8
+        else:
+            image_tensor = torch.zeros(1, 3, 1, 512, 512)
+            h = w = 512
+
+        hip_value = "7.1.0" if is_rocm else None
+        with patch("torch.version.hip", hip_value):
+            result = estimate_vae_working_memory_qwen_image(
+                operation=operation, image_tensor=image_tensor, vae=mock_vae
+            )
+
+        assert result == h * w * 2 * expected_constant
+
+
+class TestQwenImageWorkingMemory:
+    """Test that Qwen Image VAE invocations request working memory before decode/encode."""
+
+    def _mock_vae_info(self):
+        """Build a mocked AutoencoderKLQwenImage and its LoadedModel wrapper."""
+        mock_vae = MagicMock(spec=AutoencoderKLQwenImage)
+
+        # Create mock parameter for dtype detection
+        mock_param = torch.zeros(1)
+        mock_vae.parameters.return_value = iter([mock_param])
+
+        # Create mock vae_info with a model_on_device context manager yielding (None, vae)
+        mock_vae_info = MagicMock()
+        mock_vae_info.model = mock_vae
+
+        mock_cm = MagicMock()
+        mock_cm.__enter__ = MagicMock(return_value=(None, mock_vae))
+        mock_cm.__exit__ = MagicMock(return_value=None)
+        mock_vae_info.model_on_device = MagicMock(return_value=mock_cm)
+
+        return mock_vae, mock_vae_info
+
+    def test_qwen_latents_to_image_requests_working_memory(self):
+        """QwenImageLatentsToImageInvocation estimates decode memory and passes it to the cache."""
+        mock_vae, mock_vae_info = self._mock_vae_info()
+
+        # Mock the context
+        mock_context = MagicMock()
+        mock_context.models.load.return_value = mock_vae_info
+
+        # Mock latents (5D: B, C, num_frames, H, W)
+        mock_latents = torch.zeros(1, 16, 1, 64, 64)
+        mock_context.tensors.load.return_value = mock_latents
+
+        estimation_path = "invokeai.app.invocations.qwen_image_latents_to_image.estimate_vae_working_memory_qwen_image"
+        seamless_path = "invokeai.app.invocations.qwen_image_latents_to_image.SeamlessExt.static_patch_model"
+
+        with (
+            patch(estimation_path) as mock_estimate,
+            patch(seamless_path, return_value=nullcontext()),
+        ):
+            expected_memory = 1024 * 1024 * 10000  # 10GB
+            mock_estimate.return_value = expected_memory
+
+            invocation = QwenImageLatentsToImageInvocation.model_construct(
+                latents=MagicMock(latents_name="test_latents"),
+                vae=MagicMock(vae=MagicMock(), seamless_axes=["x", "y"]),
+            )
+
+            try:
+                invocation.invoke(mock_context)
+            except Exception:
+                # Downstream decode math fails under mocking; we only care that the cache was
+                # asked to reserve the estimated working memory before entering the device context.
+                pass
+
+            mock_estimate.assert_called_once()
+            assert mock_estimate.call_args.kwargs["operation"] == "decode"
+            mock_vae_info.model_on_device.assert_called_once_with(working_mem_bytes=expected_memory)
+
+    def test_qwen_image_to_latents_requests_working_memory(self):
+        """QwenImageImageToLatentsInvocation estimates encode memory and passes it to the cache."""
+        mock_vae, mock_vae_info = self._mock_vae_info()
+
+        mock_image_tensor = torch.zeros(1, 3, 512, 512)
+
+        estimation_path = "invokeai.app.invocations.qwen_image_image_to_latents.estimate_vae_working_memory_qwen_image"
+
+        with patch(estimation_path) as mock_estimate:
+            expected_memory = 1024 * 1024 * 5000  # 5GB
+            mock_estimate.return_value = expected_memory
+
+            try:
+                QwenImageImageToLatentsInvocation.vae_encode(mock_vae_info, mock_image_tensor)
+            except Exception:
+                # Downstream encode math fails under mocking; we only care that the cache was
+                # asked to reserve the estimated working memory before entering the device context.
+                pass
+
+            mock_estimate.assert_called_once()
+            assert mock_estimate.call_args.kwargs["operation"] == "encode"
+            mock_vae_info.model_on_device.assert_called_once_with(working_mem_bytes=expected_memory)

From f8e9018778d49a4f60257ba494e5db6d4e23ebd3 Mon Sep 17 00:00:00 2001
From: Lincoln Stein <lincoln.stein@gmail.com>
Date: Sat, 4 Jul 2026 18:10:31 -0400
Subject: [PATCH 3/5] docs: add 3d party GPU hosting services (#9299)

* docs: add 3d party GPU hosting services

* Update docs/src/content/docs/index.mdx

Co-authored-by: Josh Corbett <joshwcorbett@icloud.com>

* docs: restyle hosted options as LinkCards in a wrapper card

Implements joshistoast's suggested design: a bordered "Hosted Options"
wrapper containing Starlight CardGrid/LinkCard entries, replacing the
text separator and hand-rolled cards.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev>
Co-authored-by: Josh Corbett <joshwcorbett@icloud.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
---
 docs/src/content/docs/index.mdx               |  4 +-
 docs/src/lib/components/DownloadOptions.astro | 51 ++++++++++++++++++-
 docs/src/styles/custom.css                    |  5 ++
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/docs/src/content/docs/index.mdx b/docs/src/content/docs/index.mdx
index 52de38f6faf..f9f8cc25c1b 100644
--- a/docs/src/content/docs/index.mdx
+++ b/docs/src/content/docs/index.mdx
@@ -118,8 +118,8 @@ Ready to unleash your creativity? Invoke is available for Windows, macOS, and Li
 
 ---
 
-:::note[About the Hosted Version]
-The Invoke hosted platform has been shut down as the founding team joined Adobe. However, Invoke lives on as a thriving open-source project maintained by the community.
+:::note[About the former officially-hosted version]
+The Invoke.ai hosted platform has been shut down as the founding team joined Adobe. However, Invoke lives on as a thriving open-source project maintained by the community.
 
 The open-source version offers the same powerful features you may have used in the hosted service, with the added benefit of complete control and privacy through self-hosting.
 
diff --git a/docs/src/lib/components/DownloadOptions.astro b/docs/src/lib/components/DownloadOptions.astro
index dfae32a4915..95124322fd9 100644
--- a/docs/src/lib/components/DownloadOptions.astro
+++ b/docs/src/lib/components/DownloadOptions.astro
@@ -1,5 +1,5 @@
 ---
-import { LinkCard, Icon, LinkButton } from '@astrojs/starlight/components';
+import { CardGrid, LinkCard, Icon, LinkButton } from '@astrojs/starlight/components';
 import { type StarlightIcon } from '@astrojs/starlight/types';
 import { withBase } from '../base-path';
 
@@ -49,6 +49,24 @@ const manualDownloadOptions = {
     href: withBase('/configuration/docker/', import.meta.env.BASE_URL),
   },
 };
+
+const hostedOptions = {
+  aibadgr: {
+     headline: 'Run on AI Badgr',
+     description: 'Run on the AI Badgr hosted GPU service',
+     href: 'https://aibadgr.com/gpu/launch?template=invokeai',
+   },
+  runpod: {
+     headline: 'Run on RunPod',
+     description: 'Run on the RunPod hosted GPU service',
+     href: 'https://www.runpod.io/blog/invoke-ai-stable-diffusion-runpod-nfz18',
+   },
+  railway: {
+     headline: 'Run on Railway',
+     description: 'Run on the Railway hosted GPU service',
+     href: 'https://railway.com/deploy/invokeai',
+   },
+};
 ---
 
 <div class="download-options">
@@ -92,6 +110,20 @@ const manualDownloadOptions = {
       ))
     }
   </div>
+
+  <!-- Hosted Options -->
+  <div class="download-options__hosted">
+    <h3>Hosted Options</h3>
+    <p>For users who want to run Invoke on a hosted GPU service instead of their own hardware.</p>
+
+    <CardGrid>
+      {
+        Object.entries(hostedOptions).map(([key, { headline, href, description }]) => (
+          <LinkCard title={headline} {href} {description} />
+        ))
+      }
+    </CardGrid>
+  </div>
 </div>
 
 <style is:global>
@@ -134,6 +166,23 @@ const manualDownloadOptions = {
     }
   }
 
+  .download-options__hosted {
+    margin: 0;
+    border-radius: var(--radius);
+    background: var(--sl-color-black);
+    border: 1px solid var(--sl-color-gray-5);
+    padding: 1.5rem;
+
+    h3 {
+      margin-top: 0;
+    }
+
+    > p {
+      margin-top: 0.5rem;
+      color: var(--sl-color-gray-2);
+    }
+  }
+
   .download-options--launcher {
     display: grid;
     gap: 1.5rem;
diff --git a/docs/src/styles/custom.css b/docs/src/styles/custom.css
index abb93263edd..63a47b18b28 100644
--- a/docs/src/styles/custom.css
+++ b/docs/src/styles/custom.css
@@ -342,6 +342,11 @@ article.card {
 
 /* Splash Page-specific styles */
 
+/* Make the default body font a bit smaller on the splash/home page. */
+[data-has-hero] .sl-markdown-content {
+  font-size: 0.9rem;
+}
+
 @keyframes splash-animate {
   0% { background-position: 0% 0%; }
   50% { background-position: 0% 100%; }

From 89cf74d9bef7d6e4281be6058b517a0d13c7c69b Mon Sep 17 00:00:00 2001
From: Lincoln Stein <lincoln.stein@gmail.com>
Date: Sat, 4 Jul 2026 18:11:53 -0400
Subject: [PATCH 4/5] chore(release): bump version to 6.13.5

---
 invokeai/version/invokeai_version.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/invokeai/version/invokeai_version.py b/invokeai/version/invokeai_version.py
index 1d8d95d35c6..fd1442e1c37 100644
--- a/invokeai/version/invokeai_version.py
+++ b/invokeai/version/invokeai_version.py
@@ -1 +1,2 @@
-__version__ = "6.13.5.rc1"
+__version__ = "6.13.5"
+

From 9a3bd1d54cc048ede32c899731c7f12e6d773374 Mon Sep 17 00:00:00 2001
From: Lincoln Stein <lincoln.stein@gmail.com>
Date: Sat, 4 Jul 2026 18:24:41 -0400
Subject: [PATCH 5/5] chore(backend): ruff

---
 invokeai/version/invokeai_version.py | 1 -
 1 file changed, 1 deletion(-)

diff --git a/invokeai/version/invokeai_version.py b/invokeai/version/invokeai_version.py
index fd1442e1c37..7f3448c31a5 100644
--- a/invokeai/version/invokeai_version.py
+++ b/invokeai/version/invokeai_version.py
@@ -1,2 +1 @@
 __version__ = "6.13.5"
-