From b70f81f5d849df61416af88b8369ed24920491db Mon Sep 17 00:00:00 2001 From: Lincoln Stein Date: Mon, 29 Jun 2026 20:46:41 -0400 Subject: [PATCH 1/5] chore(release): bump version to v6.13.5.rc1 --- invokeai/version/invokeai_version.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/invokeai/version/invokeai_version.py b/invokeai/version/invokeai_version.py index 4007688a098..1d8d95d35c6 100644 --- a/invokeai/version/invokeai_version.py +++ b/invokeai/version/invokeai_version.py @@ -1 +1 @@ -__version__ = "6.13.0.post1" +__version__ = "6.13.5.rc1" From fd581476c64d3ddc85250c19eccb69fdb94fb31e Mon Sep 17 00:00:00 2001 From: Lincoln Stein Date: Tue, 30 Jun 2026 22:22:47 -0400 Subject: [PATCH 2/5] fix(qwen): estimate Qwen Image VAE working memory so the cache frees room before decode/encode (#9305) * fix(qwen): estimate VAE working memory so the cache frees room before decode/encode The Qwen Image l2i/i2l invocations called `model_on_device()` without a `working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache therefore only reserved the default `device_working_mem_gb` and never evicted the resident transformer/text encoder before the VAE decode. On a near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder resident) the decode then OOMs trying to allocate its working set into the fragmented remainder. Add `estimate_vae_working_memory_qwen_image()` and pass it into both the decode and encode paths so the cache makes room (evicting other models when needed) before the operation runs. The constant is calibrated against a measured decode on an AMD W7900: at 1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved* (not just allocated) memory so that whenever the cache declines to free room (free >= estimate) the decode is still guaranteed to fit. Encode uses ~half, matching the other estimators (not independently measured). Co-Authored-By: Claude Opus 4.8 * test(qwen): cover VAE working-memory estimate is passed to cache Address review feedback from @Pfannkuchensack on #9305: - Add test_qwen_image_working_memory.py mirroring the z-image pattern, asserting both decode and encode paths call model_on_device with the estimated working_mem_bytes (regression guard for the OOM fix). - Clarify the qwen estimator comment: the encode constant is not independently measured (half of decode, matching siblings' ratio) and should be recalibrated against a measured encode. Co-Authored-By: Claude Opus 4.8 (1M context) * fix(qwen): recalibrate VAE working-memory constants from a measured grid Add scripts/calibrate_qwen_vae_working_memory.py, a backend-portable (CUDA/ROCm) harness that measures peak reserved-memory growth for VAE decode/encode across a resolution grid, one fresh subprocess per point. Calibrating on an AMD W7900 (fp16) showed the encode constant was wrong: the previous 2750 ("half of decode") under-estimated by ~2x at every measured resolution, the exact OOM mode Qwen Image Edit (which encodes a real image) would hit. Raise encode 2750 -> 6300. Decode 5500 is confirmed safe across the full 512^2..2048^2 range and left unchanged. The grid also showed memory is super-linear in area above ~1792^2 (an attention term) and non-monotonic (likely an SDPA-backend crossover on ROCm); both documented in the estimator. Constants are the conservative ROCm side and will be max-merged with a pending NVIDIA/CUDA run. Co-Authored-By: Claude Opus 4.8 (1M context) * fix(qwen): branch VAE working-memory constants by backend (ROCm vs CUDA) Calibrating the same fp16 grid on an NVIDIA card showed CUDA reserves ~2x (decode) to ~4x (encode) less than ROCm: the Qwen VAE is attention- heavy, and CUDA's Flash/efficient attention is O(area) and flat while the ROCm math-attention fallback is O(area^2). The backends diverge far more than any headroom, so a single constant either under-estimates on ROCm (OOM) or massively over-budgets CUDA (needless eviction). Select constants via torch.version.hip: decode: ROCm 5500 / CUDA 2900 encode: ROCm 6300 / CUDA 1600 Each verified to cover its measured grid (19 points/backend) with ~8% headroom. The CUDA run also confirms the linear model holds with Flash attention (the ROCm super-linear/non-monotonic behavior is a math- attention artifact), and that "encode is half of decode" is CUDA-only. Add parametrized tests asserting the constant selected for each (operation, backend) so a refactor can't silently swap them. Co-Authored-By: Claude Opus 4.8 (1M context) * chore(backend): ruff * calibrate: support single-file Qwen Image VAE checkpoints The calibration script only loaded the Qwen VAE from a diffusers directory via from_pretrained, so passing a single .safetensors file failed. Add _load_vae, which loads a directory as before and handles a single-file checkpoint by loading the state dict directly: a strict load for the diffusers layout, falling back to convert_wan_vae_to_diffusers for the original Qwen-Image/Wan release layout (downsamples/residual/ time_conv keys) before retrying. --------- Co-authored-by: Claude Opus 4.8 Co-authored-by: Alexander Eichhorn --- .../qwen_image_image_to_latents.py | 9 +- .../qwen_image_latents_to_image.py | 8 +- invokeai/backend/util/vae_working_memory.py | 52 +++ scripts/calibrate_qwen_vae_working_memory.py | 305 ++++++++++++++++++ .../test_qwen_image_working_memory.py | 136 ++++++++ 5 files changed, 508 insertions(+), 2 deletions(-) create mode 100644 scripts/calibrate_qwen_vae_working_memory.py create mode 100644 tests/app/invocations/test_qwen_image_working_memory.py diff --git a/invokeai/app/invocations/qwen_image_image_to_latents.py b/invokeai/app/invocations/qwen_image_image_to_latents.py index ef88e03082b..ffae5470f68 100644 --- a/invokeai/app/invocations/qwen_image_image_to_latents.py +++ b/invokeai/app/invocations/qwen_image_image_to_latents.py @@ -18,6 +18,7 @@ from invokeai.backend.model_manager.load.load_base import LoadedModel from invokeai.backend.stable_diffusion.diffusers_pipeline import image_resized_to_grid_as_tensor from invokeai.backend.util.devices import TorchDevice +from invokeai.backend.util.vae_working_memory import estimate_vae_working_memory_qwen_image @invocation( @@ -44,7 +45,13 @@ class QwenImageImageToLatentsInvocation(BaseInvocation, WithMetadata, WithBoard) @staticmethod def vae_encode(vae_info: LoadedModel, image_tensor: torch.Tensor) -> torch.Tensor: - with vae_info.model_on_device() as (_, vae): + assert isinstance(vae_info.model, AutoencoderKLQwenImage) + estimated_working_memory = estimate_vae_working_memory_qwen_image( + operation="encode", + image_tensor=image_tensor, + vae=vae_info.model, + ) + with vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae): assert isinstance(vae, AutoencoderKLQwenImage) vae.disable_tiling() diff --git a/invokeai/app/invocations/qwen_image_latents_to_image.py b/invokeai/app/invocations/qwen_image_latents_to_image.py index b3ea39c4bbf..072185f147b 100644 --- a/invokeai/app/invocations/qwen_image_latents_to_image.py +++ b/invokeai/app/invocations/qwen_image_latents_to_image.py @@ -19,6 +19,7 @@ from invokeai.app.services.shared.invocation_context import InvocationContext from invokeai.backend.stable_diffusion.extensions.seamless import SeamlessExt from invokeai.backend.util.devices import TorchDevice +from invokeai.backend.util.vae_working_memory import estimate_vae_working_memory_qwen_image @invocation( @@ -41,9 +42,14 @@ def invoke(self, context: InvocationContext) -> ImageOutput: vae_info = context.models.load(self.vae.vae) assert isinstance(vae_info.model, AutoencoderKLQwenImage) + estimated_working_memory = estimate_vae_working_memory_qwen_image( + operation="decode", + image_tensor=latents, + vae=vae_info.model, + ) with ( SeamlessExt.static_patch_model(vae_info.model, self.vae.seamless_axes), - vae_info.model_on_device() as (_, vae), + vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae), ): context.util.signal_progress("Running VAE") assert isinstance(vae, AutoencoderKLQwenImage) diff --git a/invokeai/backend/util/vae_working_memory.py b/invokeai/backend/util/vae_working_memory.py index f9228ced652..57d558c03b4 100644 --- a/invokeai/backend/util/vae_working_memory.py +++ b/invokeai/backend/util/vae_working_memory.py @@ -2,6 +2,7 @@ import torch from diffusers.models.autoencoders.autoencoder_kl import AutoencoderKL +from diffusers.models.autoencoders.autoencoder_kl_qwenimage import AutoencoderKLQwenImage from diffusers.models.autoencoders.autoencoder_tiny import AutoencoderTiny from invokeai.app.invocations.constants import LATENT_SCALE_FACTOR @@ -92,6 +93,57 @@ def estimate_vae_working_memory_flux( return int(working_memory) +def estimate_vae_working_memory_qwen_image( + operation: Literal["encode", "decode"], image_tensor: torch.Tensor, vae: AutoencoderKLQwenImage +) -> int: + """Estimate the working memory required by the invocation in bytes. + + The Qwen Image VAE is a video-style autoencoder that operates on 5D tensors of shape + (B, C, num_frames, H, W). Tiling is not used, so peak working memory scales with the full + spatial output. The two trailing dimensions are the spatial H/W in latent space (decode) or + pixel space (encode), matching the convention used by the other estimators here. + """ + latent_scale_factor_for_operation = LATENT_SCALE_FACTOR if operation == "decode" else 1 + + h = latent_scale_factor_for_operation * image_tensor.shape[-2] + w = latent_scale_factor_for_operation * image_tensor.shape[-1] + element_size = next(vae.parameters()).element_size() + + # The Qwen Image VAE is much heavier than the SD/SDXL VAE and needs correspondingly larger + # constants. These were calibrated by measuring peak *reserved* memory growth (not just allocated + # -- reserved is what the cache's `free >= estimate` check compares against) across a resolution + # grid in fp16, on both an AMD W7900 (ROCm) and an NVIDIA card (CUDA). See + # scripts/calibrate_qwen_vae_working_memory.py. + # + # Implied constant = reserved_bytes / (h * w * element_size). Per-point maxima (fp16): + # 512^2 768^2 1024^2 1536^2 1792^2 2048^2 -> ship (max observed + ~8% headroom) + # ROCm decode 5132 4596 4570 3273 3735 4813 -> 5500 + # ROCm encode 5864 5858 5858 3532 4364 (OOM) -> 6300 + # CUDA decode 2660 2519 2690 2671 2281 (OOM) -> 2900 + # CUDA encode 1456 1451 1458 1456 1455 1455 -> 1600 + # + # Why this branches on backend (the only estimator here that does): + # - The Qwen VAE is attention-heavy. With Flash/efficient attention (CUDA) the attention memory + # is O(area) and the curve is flat/linear; the ROCm build falls back to math attention, which + # is O(area^2), so ROCm reserves ~2x (decode) to ~4x (encode) more and goes super-linear above + # ~1792^2. The two backends differ far more than any headroom, so a single constant would + # either under-estimate on ROCm (OOM) or massively over-budget on CUDA (needless eviction). + # - "Encoding is half of decoding" (as the sibling estimators assume) is only true on CUDA. On + # ROCm encode reserves >= decode, so the ROCm encode constant is sized accordingly -- this is + # the path Qwen Image Edit exercises. + # - On ROCm the linear model under-estimates for decodes well above 2048^2, but those OOM on a + # 48GB card regardless; on CUDA the curve stays linear so no extra term is needed. + is_rocm = torch.version.hip is not None + if operation == "decode": + scaling_constant = 5500 if is_rocm else 2900 + else: # encode + scaling_constant = 6300 if is_rocm else 1600 + + working_memory = h * w * element_size * scaling_constant + + return int(working_memory) + + def estimate_vae_working_memory_sd3( operation: Literal["encode", "decode"], image_tensor: torch.Tensor, vae: AutoencoderKL ) -> int: diff --git a/scripts/calibrate_qwen_vae_working_memory.py b/scripts/calibrate_qwen_vae_working_memory.py new file mode 100644 index 00000000000..2f81752aece --- /dev/null +++ b/scripts/calibrate_qwen_vae_working_memory.py @@ -0,0 +1,305 @@ +"""Calibrate the Qwen Image VAE working-memory estimate against measured peak CUDA/HIP memory. + +Background +---------- +``estimate_vae_working_memory_qwen_image`` models peak working memory as a linear function of +spatial area:: + + working_memory = h * w * element_size * scaling_constant + +This script measures the *actual* peak reserved memory the VAE consumes during decode/encode across +a grid of resolutions so the ``scaling_constant`` can be fit from several points instead of one, and +so we can check whether the pure-linear model holds or whether a super-linear (attention) term +appears at high resolution. + +The estimate is consumed by the model cache via ``free >= estimate`` to decide whether to evict, so +it MUST be an upper bound: we measure peak *reserved* (not just allocated) memory, the conservative +quantity that includes caching-allocator overhead and kernel scratch/workspace. + +Portability +----------- +Backend-agnostic: uses only ``torch.cuda.*``, which works on both NVIDIA/CUDA and AMD/ROCm (HIP) +builds of PyTorch. Run the SAME script on each backend and compare the ``implied_constant`` columns +-- the curve *shape* (linear vs. super-linear) is architectural and should match, but the absolute +constant can differ (cuDNN vs. MIOpen conv workspaces, flash-attention availability, allocator +rounding). Ship ``max`` across backends plus headroom. + +Each (operation, resolution) point is measured in a FRESH SUBPROCESS so the caching allocator's +fragmentation history from earlier points cannot contaminate the reserved-delta reading. A point +that OOMs is recorded as ``oom`` rather than aborting the run, so the grid can probe up to the +card's ceiling safely. + +Usage +----- + python scripts/calibrate_qwen_vae_working_memory.py [--vae /path/to/vae_dir] [--csv out.csv] + +If ``--vae`` is omitted, the script auto-discovers an ``AutoencoderKLQwenImage`` under +``$INVOKEAI_ROOT/models``. +""" + +import argparse +import json +import os +import subprocess +import sys +from pathlib import Path + +import torch +from diffusers.models.autoencoders.autoencoder_kl_qwenimage import AutoencoderKLQwenImage + +LATENT_SCALE_FACTOR = 8 + +# (height, width) pixel-space resolutions. Squares to test linearity in area, plus non-square +# points (incl. the original 1248x832 calibration point) to confirm area = h*w is the right +# predictor rather than max(h, w) or perimeter. Subprocess isolation + OOM capture means we can +# list aggressive resolutions; ones that don't fit are simply recorded as oom. +DEFAULT_RESOLUTIONS = [ + (512, 512), + (768, 768), + (832, 1248), # original single calibration point (as HxW) + (1024, 1024), + (1088, 1920), + (1280, 1280), + (1536, 1024), + (1536, 1536), + (1792, 1792), + (2048, 2048), +] + + +def discover_vae() -> Path: + """Find an AutoencoderKLQwenImage VAE directory under $INVOKEAI_ROOT/models.""" + root = os.environ.get("INVOKEAI_ROOT") + if not root: + raise SystemExit("INVOKEAI_ROOT not set; pass --vae explicitly.") + models = Path(root) / "models" + for config_path in models.glob("*/vae/config.json"): + try: + cfg = json.loads(config_path.read_text()) + except Exception: + continue + if cfg.get("_class_name") == "AutoencoderKLQwenImage": + return config_path.parent + raise SystemExit(f"No AutoencoderKLQwenImage VAE found under {models}; pass --vae explicitly.") + + +DTYPES = {"float16": torch.float16, "bfloat16": torch.bfloat16, "float32": torch.float32} + + +def _load_vae(vae_path: str, dtype: torch.dtype) -> AutoencoderKLQwenImage: + """Load an AutoencoderKLQwenImage from either a diffusers directory or a single .safetensors file. + + Directory: standard ``from_pretrained``. + Single file: ``AutoencoderKLQwenImage`` has no single-file converter registered in diffusers, + so we instantiate with the default config and load the state dict directly. Two on-disk layouts + exist: the diffusers layout (``encoder.conv_in`` / ``down_blocks`` / ``mid_block`` keys, e.g. the + weights InvokeAI's VAELoader consumes) and the original Qwen-Image/Wan release layout + (``encoder.conv1`` / ``downsamples`` / ``middle`` / ``time_conv`` keys). We try a direct strict + load first, and on a key mismatch fall back to diffusers' Wan VAE converter -- the Qwen-Image VAE + shares the Wan VAE key structure -- before retrying. + """ + path = Path(vae_path) + if not path.is_file(): + return AutoencoderKLQwenImage.from_pretrained(vae_path, local_files_only=True, torch_dtype=dtype) + + from safetensors.torch import load_file + + sd = load_file(str(path)) + for k in list(sd.keys()): + if sd[k].is_floating_point(): + sd[k] = sd[k].to(dtype) + + vae = AutoencoderKLQwenImage() + try: + # diffusers-layout checkpoint: keys already match the model. State dict was converted to + # `dtype` above and is assigned in place, so params carry the correct dtype. + vae.load_state_dict(sd, strict=True, assign=True) + except RuntimeError: + # Original Qwen-Image/Wan release layout: convert keys to the diffusers layout, then retry. + from diffusers.loaders.single_file_utils import convert_wan_vae_to_diffusers + + converted = convert_wan_vae_to_diffusers(sd) + for k in list(converted.keys()): + if converted[k].is_floating_point(): + converted[k] = converted[k].to(dtype) + vae.load_state_dict(converted, strict=True, assign=True) + return vae + + +def _build_input(operation: str, h: int, w: int, z_dim: int, dtype: torch.dtype) -> torch.Tensor: + """Construct the 5D (B, C, num_frames, H, W) input the invocation feeds the VAE. + + decode: latents at latent resolution (H/8, W/8) with z_dim channels. + encode: image at pixel resolution (H, W) with 3 channels. + These mirror QwenImageLatentsToImageInvocation / QwenImageImageToLatentsInvocation exactly. + """ + device = torch.device("cuda") + if operation == "decode": + return torch.randn(1, z_dim, 1, h // LATENT_SCALE_FACTOR, w // LATENT_SCALE_FACTOR, device=device, dtype=dtype) + return torch.randn(1, 3, 1, h, w, device=device, dtype=dtype) + + +@torch.inference_mode() +def measure_one(vae_path: str, operation: str, h: int, w: int, dtype: torch.dtype) -> dict: + """Measure peak reserved-memory growth for a single decode/encode. Runs in a child process.""" + vae = _load_vae(vae_path, dtype) + vae.to("cuda") + vae.disable_tiling() # Qwen invocations never tile; match that. + + param = next(vae.parameters()) + dtype = param.dtype + element_size = param.element_size() + z_dim = int(vae.config.z_dim) + + x = _build_input(operation, h, w, z_dim, dtype) + + torch.cuda.synchronize() + torch.cuda.empty_cache() + torch.cuda.reset_peak_memory_stats() + baseline_reserved = torch.cuda.memory_reserved() + + # Measure the COLD first call -- it includes conv-algorithm-search / attention workspace + # allocation, which is exactly what the real (single-shot) invocation pays. + try: + if operation == "decode": + vae.decode(x, return_dict=False) + else: + vae.encode(x).latent_dist.mode() + torch.cuda.synchronize() + except (torch.cuda.OutOfMemoryError, RuntimeError) as e: + if "out of memory" not in str(e).lower(): + raise + return {"operation": operation, "h": h, "w": w, "oom": True} + + peak_reserved = torch.cuda.max_memory_reserved() + peak_allocated = torch.cuda.max_memory_allocated() + reserved_delta = peak_reserved - baseline_reserved + + area = h * w + return { + "operation": operation, + "h": h, + "w": w, + "area": area, + "element_size": element_size, + "dtype": str(dtype), + "reserved_delta": reserved_delta, + "allocated_peak": peak_allocated, + "reserved_baseline": baseline_reserved, + # The constant as the estimator parameterizes it: mem = area * element_size * k + "implied_constant": reserved_delta / (area * element_size), + "oom": False, + } + + +def run_grid(vae_path: str, resolutions: list[tuple[int, int]], dtype_name: str, csv_path: Path | None) -> None: + rows: list[dict] = [] + print(f"VAE: {vae_path}") + print( + f"torch {torch.__version__} | device {torch.cuda.get_device_name(0)} | hip={torch.version.hip} | dtype={dtype_name}\n" + ) + print(f"{'op':6} {'HxW':>11} {'area':>10} {'reserved(GiB)':>14} {'alloc(GiB)':>11} {'implied_k':>10}") + print("-" * 70) + + for operation in ("decode", "encode"): + for h, w in resolutions: + # Fresh subprocess per point for an uncontaminated reserved-memory reading. + proc = subprocess.run( + [ + sys.executable, + __file__, + "--single", + operation, + str(h), + str(w), + "--vae", + vae_path, + "--dtype", + dtype_name, + ], + capture_output=True, + text=True, + ) + line = proc.stdout.strip().splitlines()[-1] if proc.stdout.strip() else "" + try: + row = json.loads(line) + except Exception: + print(f"{operation:6} {f'{h}x{w}':>11} FAILED: {proc.stderr.strip().splitlines()[-1:]}") + continue + rows.append(row) + if row.get("oom"): + print(f"{operation:6} {f'{h}x{w}':>11} {h * w:>10} {'OOM':>14}") + continue + gib = 1024**3 + print( + f"{operation:6} {f'{h}x{w}':>11} {row['area']:>10} " + f"{row['reserved_delta'] / gib:>14.3f} {row['allocated_peak'] / gib:>11.3f} " + f"{row['implied_constant']:>10.1f}" + ) + + # Summary: the shippable constant is the MAX implied constant over fitting points (upper bound). + print("\n=== summary (max implied constant = candidate scaling_constant, before headroom) ===") + for operation in ("decode", "encode"): + ks = [r["implied_constant"] for r in rows if r["operation"] == operation and not r.get("oom")] + if ks: + print( + f"{operation:6}: n={len(ks)} min_k={min(ks):.1f} max_k={max(ks):.1f} " + f"-> use >= {max(ks):.0f} (+headroom)" + ) + + if csv_path: + import csv + + fieldnames = [ + "operation", + "h", + "w", + "area", + "element_size", + "dtype", + "reserved_delta", + "allocated_peak", + "reserved_baseline", + "implied_constant", + "oom", + ] + with csv_path.open("w", newline="") as f: + writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore") + writer.writeheader() + for r in rows: + writer.writerow(r) + print(f"\nWrote {csv_path}") + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument( + "--vae", + type=str, + default=None, + help="Path to an AutoencoderKLQwenImage diffusers dir OR a single .safetensors checkpoint.", + ) + parser.add_argument("--csv", type=str, default=None, help="Optional path to write the raw results as CSV.") + parser.add_argument( + "--dtype", + choices=list(DTYPES), + default="float16", + help="Compute dtype. Default float16 to match InvokeAI's default precision on CUDA/ROCm.", + ) + # Internal: measure a single point in this process and print one JSON line. + parser.add_argument("--single", nargs=3, metavar=("OP", "H", "W"), default=None, help=argparse.SUPPRESS) + args = parser.parse_args() + + vae_path = args.vae or str(discover_vae()) + dtype = DTYPES[args.dtype] + + if args.single: + op, h, w = args.single[0], int(args.single[1]), int(args.single[2]) + print(json.dumps(measure_one(vae_path, op, h, w, dtype))) + return + + run_grid(vae_path, DEFAULT_RESOLUTIONS, args.dtype, Path(args.csv) if args.csv else None) + + +if __name__ == "__main__": + main() diff --git a/tests/app/invocations/test_qwen_image_working_memory.py b/tests/app/invocations/test_qwen_image_working_memory.py new file mode 100644 index 00000000000..f3dfbe970df --- /dev/null +++ b/tests/app/invocations/test_qwen_image_working_memory.py @@ -0,0 +1,136 @@ +"""Test that Qwen Image VAE invocations properly estimate and request working memory.""" + +from contextlib import nullcontext +from unittest.mock import MagicMock, patch + +import pytest +import torch +from diffusers.models.autoencoders.autoencoder_kl_qwenimage import AutoencoderKLQwenImage + +from invokeai.app.invocations.qwen_image_image_to_latents import QwenImageImageToLatentsInvocation +from invokeai.app.invocations.qwen_image_latents_to_image import QwenImageLatentsToImageInvocation +from invokeai.backend.util.vae_working_memory import estimate_vae_working_memory_qwen_image + + +class TestQwenImageWorkingMemoryEstimate: + """Lock in the per-backend scaling constants calibrated in scripts/calibrate_qwen_vae_working_memory.py. + + These differ by backend because the Qwen VAE is attention-heavy: ROCm falls back to math attention + (O(area^2), much higher memory) while CUDA uses Flash/efficient attention. A regression that swaps + the constants would reintroduce the ROCm OOM (under-estimate) or needlessly over-budget CUDA. + """ + + # (operation, latent_h, latent_w) -> the estimator scales pixel area (latent * 8 for decode, + # raw for encode) by element_size and the constant. + @pytest.mark.parametrize( + "operation, is_rocm, expected_constant", + [ + ("decode", True, 5500), + ("decode", False, 2900), + ("encode", True, 6300), + ("encode", False, 1600), + ], + ) + def test_constant_selected_per_backend(self, operation, is_rocm, expected_constant): + mock_vae = MagicMock(spec=AutoencoderKLQwenImage) + mock_vae.parameters.return_value = iter([torch.zeros(1, dtype=torch.float16)]) # element_size == 2 + + # decode receives latents (pixel area = latent area * 8^2); encode receives a pixel image. + if operation == "decode": + image_tensor = torch.zeros(1, 16, 1, 64, 64) + h = w = 64 * 8 + else: + image_tensor = torch.zeros(1, 3, 1, 512, 512) + h = w = 512 + + hip_value = "7.1.0" if is_rocm else None + with patch("torch.version.hip", hip_value): + result = estimate_vae_working_memory_qwen_image( + operation=operation, image_tensor=image_tensor, vae=mock_vae + ) + + assert result == h * w * 2 * expected_constant + + +class TestQwenImageWorkingMemory: + """Test that Qwen Image VAE invocations request working memory before decode/encode.""" + + def _mock_vae_info(self): + """Build a mocked AutoencoderKLQwenImage and its LoadedModel wrapper.""" + mock_vae = MagicMock(spec=AutoencoderKLQwenImage) + + # Create mock parameter for dtype detection + mock_param = torch.zeros(1) + mock_vae.parameters.return_value = iter([mock_param]) + + # Create mock vae_info with a model_on_device context manager yielding (None, vae) + mock_vae_info = MagicMock() + mock_vae_info.model = mock_vae + + mock_cm = MagicMock() + mock_cm.__enter__ = MagicMock(return_value=(None, mock_vae)) + mock_cm.__exit__ = MagicMock(return_value=None) + mock_vae_info.model_on_device = MagicMock(return_value=mock_cm) + + return mock_vae, mock_vae_info + + def test_qwen_latents_to_image_requests_working_memory(self): + """QwenImageLatentsToImageInvocation estimates decode memory and passes it to the cache.""" + mock_vae, mock_vae_info = self._mock_vae_info() + + # Mock the context + mock_context = MagicMock() + mock_context.models.load.return_value = mock_vae_info + + # Mock latents (5D: B, C, num_frames, H, W) + mock_latents = torch.zeros(1, 16, 1, 64, 64) + mock_context.tensors.load.return_value = mock_latents + + estimation_path = "invokeai.app.invocations.qwen_image_latents_to_image.estimate_vae_working_memory_qwen_image" + seamless_path = "invokeai.app.invocations.qwen_image_latents_to_image.SeamlessExt.static_patch_model" + + with ( + patch(estimation_path) as mock_estimate, + patch(seamless_path, return_value=nullcontext()), + ): + expected_memory = 1024 * 1024 * 10000 # 10GB + mock_estimate.return_value = expected_memory + + invocation = QwenImageLatentsToImageInvocation.model_construct( + latents=MagicMock(latents_name="test_latents"), + vae=MagicMock(vae=MagicMock(), seamless_axes=["x", "y"]), + ) + + try: + invocation.invoke(mock_context) + except Exception: + # Downstream decode math fails under mocking; we only care that the cache was + # asked to reserve the estimated working memory before entering the device context. + pass + + mock_estimate.assert_called_once() + assert mock_estimate.call_args.kwargs["operation"] == "decode" + mock_vae_info.model_on_device.assert_called_once_with(working_mem_bytes=expected_memory) + + def test_qwen_image_to_latents_requests_working_memory(self): + """QwenImageImageToLatentsInvocation estimates encode memory and passes it to the cache.""" + mock_vae, mock_vae_info = self._mock_vae_info() + + mock_image_tensor = torch.zeros(1, 3, 512, 512) + + estimation_path = "invokeai.app.invocations.qwen_image_image_to_latents.estimate_vae_working_memory_qwen_image" + + with patch(estimation_path) as mock_estimate: + expected_memory = 1024 * 1024 * 5000 # 5GB + mock_estimate.return_value = expected_memory + + try: + QwenImageImageToLatentsInvocation.vae_encode(mock_vae_info, mock_image_tensor) + except Exception: + # Downstream encode math fails under mocking; we only care that the cache was + # asked to reserve the estimated working memory before entering the device context. + pass + + mock_estimate.assert_called_once() + assert mock_estimate.call_args.kwargs["operation"] == "encode" + mock_vae_info.model_on_device.assert_called_once_with(working_mem_bytes=expected_memory) From f8e9018778d49a4f60257ba494e5db6d4e23ebd3 Mon Sep 17 00:00:00 2001 From: Lincoln Stein Date: Sat, 4 Jul 2026 18:10:31 -0400 Subject: [PATCH 3/5] docs: add 3d party GPU hosting services (#9299) * docs: add 3d party GPU hosting services * Update docs/src/content/docs/index.mdx Co-authored-by: Josh Corbett * docs: restyle hosted options as LinkCards in a wrapper card Implements joshistoast's suggested design: a bordered "Hosted Options" wrapper containing Starlight CardGrid/LinkCard entries, replacing the text separator and hand-rolled cards. Co-Authored-By: Claude Fable 5 --------- Co-authored-by: Alexander Eichhorn Co-authored-by: Josh Corbett Co-authored-by: Claude Fable 5 --- docs/src/content/docs/index.mdx | 4 +- docs/src/lib/components/DownloadOptions.astro | 51 ++++++++++++++++++- docs/src/styles/custom.css | 5 ++ 3 files changed, 57 insertions(+), 3 deletions(-) diff --git a/docs/src/content/docs/index.mdx b/docs/src/content/docs/index.mdx index 52de38f6faf..f9f8cc25c1b 100644 --- a/docs/src/content/docs/index.mdx +++ b/docs/src/content/docs/index.mdx @@ -118,8 +118,8 @@ Ready to unleash your creativity? Invoke is available for Windows, macOS, and Li --- -:::note[About the Hosted Version] -The Invoke hosted platform has been shut down as the founding team joined Adobe. However, Invoke lives on as a thriving open-source project maintained by the community. +:::note[About the former officially-hosted version] +The Invoke.ai hosted platform has been shut down as the founding team joined Adobe. However, Invoke lives on as a thriving open-source project maintained by the community. The open-source version offers the same powerful features you may have used in the hosted service, with the added benefit of complete control and privacy through self-hosting. diff --git a/docs/src/lib/components/DownloadOptions.astro b/docs/src/lib/components/DownloadOptions.astro index dfae32a4915..95124322fd9 100644 --- a/docs/src/lib/components/DownloadOptions.astro +++ b/docs/src/lib/components/DownloadOptions.astro @@ -1,5 +1,5 @@ --- -import { LinkCard, Icon, LinkButton } from '@astrojs/starlight/components'; +import { CardGrid, LinkCard, Icon, LinkButton } from '@astrojs/starlight/components'; import { type StarlightIcon } from '@astrojs/starlight/types'; import { withBase } from '../base-path'; @@ -49,6 +49,24 @@ const manualDownloadOptions = { href: withBase('/configuration/docker/', import.meta.env.BASE_URL), }, }; + +const hostedOptions = { + aibadgr: { + headline: 'Run on AI Badgr', + description: 'Run on the AI Badgr hosted GPU service', + href: 'https://aibadgr.com/gpu/launch?template=invokeai', + }, + runpod: { + headline: 'Run on RunPod', + description: 'Run on the RunPod hosted GPU service', + href: 'https://www.runpod.io/blog/invoke-ai-stable-diffusion-runpod-nfz18', + }, + railway: { + headline: 'Run on Railway', + description: 'Run on the Railway hosted GPU service', + href: 'https://railway.com/deploy/invokeai', + }, +}; ---
@@ -92,6 +110,20 @@ const manualDownloadOptions = { )) }
+ + +
+

Hosted Options

+

For users who want to run Invoke on a hosted GPU service instead of their own hardware.

+ + + { + Object.entries(hostedOptions).map(([key, { headline, href, description }]) => ( + + )) + } + +