test: make preview integration/e2e tests resilient to transient infra errors (INFRA-3802) by cmlad · Pull Request #1102 · fal-ai/fal

cmlad · 2026-07-02T00:56:01Z

Part of INFRA-3802 (make the isolate-cloud preview integration tests rock solid). Every change below fixes a concrete failure observed in preview CI runs; the infra-side fixes live in fal-ai/isolate-cloud#8202. No retries — tests run once and rely on correct conditions/timeouts.

`src/fal/apps.py` — shared HTTP client timeout 5s → 30s (connect 10s)

The queue client (apps.run/submit/status) was created with no explicit timeout, inheriting httpx's default of 5 seconds for connect/read/write. Queue submits and status/result fetches legitimately exceed 5s on a busy cluster (the server-side workflow engine used 20s for the same calls and still hit timeouts). A read that takes 6 seconds is latency, not a failure; with the default, it raised httpx.ReadTimeout and failed tests (and user code) spuriously. 30s read / 10s connect keeps genuine hangs bounded while not misclassifying slow-but-successful responses.

`src/fal/app.py` — startup health checks treat 502/503/504 as "not ready yet"

AppInfo.wait() polls the app's /health endpoint until startup_timeout, retrying on 500/404 ("server not ready") but treating any other status as fatal. During startup the platform gateway returns 503 {"error_type": "runner_connection_error"} while the runner is booting but not yet reachable — a transient readiness state by definition, observed aborting AppClient.connect() in preview runs (e.g. test_404_billable_units). Gateway-level 502/503/504 now count as not-ready within the existing startup window; genuinely non-retryable statuses (401/403/422/…) still raise immediately. No new retry mechanism — this only corrects which statuses the existing readiness poll classifies as "still starting".

`tests/e2e/test_apps.py` — module-scoped fixtures for read-only apps

test_app (addition), test_stateful_app, test_cancellable_app and test_realtime_app were function-scoped: ~18 registrations + runner cold starts per suite run for apps whose tests never mutate the deployment. Cold starts dominate suite wall clock (during degraded fleet windows we measured 30 runner starts per 80 minutes, which made the suite unfinishable). These four now register once per module via a new module_register_app fixture; their tests only submit requests (the stateful tests reset the counter first, the cancellable tests cancel only their own requests). Fixtures whose tests kill/stop/roll out runners or mutate the deployment (test_sleep_app, base_app, deploy/scale tests) stay function-scoped because they assert on exact runner counts and revisions.

Mechanical side effect: the tests that used a shared httpx.Client(headers=_auth_headers()) block now call httpx.post(...) directly with explicit headers=_auth_headers() — same requests, less nesting.

`pyproject.toml` — per-test timeout 60s → 180s

The 60s budget was below the by-design runtime of existing tests against a healthy platform: test_app_disconnect_behavior performs two intentional 6s-wait/504 round-trips plus polling (~55s measured, with 50s faulthandler dumps in passing runs), and the container tests build a Docker image whose dockerfile embeds the git revision, so every new commit triggers a fresh 1–3 minute image build. 180s covers real healthy behavior with margin; it is deliberately not sized to degraded-fleet windows — if the platform is unhealthy, tests should fail, not wait it out. faulthandler_timeout moves to 170s so pre-timeout stack dumps still fire.

🤖 Generated with Claude Code

… errors (INFRA-3802) Preview deployment CI fails ~74% of the time, almost always on infrastructure-level transients rather than product regressions: - fal/apps.py: retry queue status/result/cancel requests on dropped keep-alive connections, connect errors and transient 5xx; submits only retry when the request was definitely not sent. Raise the shared HTTP client timeout from the httpx default of 5s to 30s. - fal/app.py: treat 502/503/504 during startup health checks as retryable (e.g. runner_connection_error while the runner comes up). - e2e tests: retry app registration and alias deletion on transient control-plane errors (the preview control plane is restarted right before the tests run); route direct httpx calls through a helper that retries transport errors and infra 502/503s. - integration tests: mark the runner-cold-start tests flaky (Nomad docker-auth failures on individual nodes kill single allocations). - pytest: raise per-test timeout 60s -> 120s; test_app_disconnect_behavior legitimately takes ~55s and was flirting with the limit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

linear-code · 2026-07-02T00:56:03Z

INFRA-3802

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

socket-security · 2026-07-02T01:30:15Z

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action	Severity	Alert (click "▶" to expand/collapse)
Warn		Obfuscated code: pypi `fastapi` is 90.0% likely obfuscated Confidence: 0.90 Location: Package overview From: projects/fal/pyproject.toml → `pypi/fastapi@0.128.8` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `support@socket.dev`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore pypi/fastapi@0.128.8`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.
Warn		Obfuscated code: pypi `fastapi` is 90.0% likely obfuscated Confidence: 0.90 Location: Package overview From: projects/fal/pyproject.toml → `pypi/fastapi@0.128.8` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `support@socket.dev`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore pypi/fastapi@0.128.8`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.
Warn		Obfuscated code: pypi `numpy` is 90.0% likely obfuscated Confidence: 0.90 Location: Package overview From: projects/fal/pyproject.toml → `pypi/numpy@2.4.6` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `support@socket.dev`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore pypi/numpy@2.4.6`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.
Warn		Obfuscated code: pypi `pycparser` is 90.0% likely obfuscated Confidence: 0.90 Location: Package overview From: projects/fal/uv.lock → `pypi/pyzmq@27.1.0` → `pypi/pycparser@3.0` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `support@socket.dev`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore pypi/pycparser@3.0`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.
Warn		Obfuscated code: pypi `pycparser` is 90.0% likely obfuscated Confidence: 0.90 Location: Package overview From: projects/fal/uv.lock → `pypi/pyzmq@27.1.0` → `pypi/pycparser@3.0` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `support@socket.dev`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore pypi/pycparser@3.0`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

…tcher retries Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A wedged connection used to stall each poll for the full 30s client timeout (x3 retry attempts); polls answer in milliseconds and repeat in a loop, so time out fast and let the retry reconnect. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Cold starts on the shared dev fleet regularly exceed 120s during busy windows (8 run_in_worker spans over 60s in one 40-minute window), and container tests rebuild their image whenever the git revision changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The dev fleet sometimes schedules preview runners on remote-region nodes (observed boostrun us-east), making every cold start pay cross-region image/env pulls of 3-4 minutes. The suites run under xdist now, so a generous per-test ceiling no longer threatens the job budget and only matters when the fleet is degraded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The gateway occasionally drops a websocket with 1011 'Error while forwarding the request' (transient runner-forwarding failure); the ws protocol has no request-level retry, so rerun the test instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Same remedy as the chaos suite: requests occasionally sit IN_QUEUE for minutes while other queues flow; a fresh submit goes through. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every function-scoped app costs a registration plus a runner cold start; during busy dev-fleet windows cold-start throughput collapses (observed 30 starts in 82 minutes) and the suite cannot finish inside any reasonable budget. The addition, stateful, cancellable and realtime apps are only read by their tests, so register them once per module (per xdist worker). Runner-lifecycle tests (stop/kill/rollout) keep their function-scoped apps since they assert on exact runner counts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Per review: tests must pass on first try against healthy infra. Removed the HTTP retry helper and call-site wrappers, register/delete retries, resubmit-on-stuck-queue, the flaky marks added for infra noise, and the degraded-env test budgets. Kept the logical fixes: 30s client timeout (httpx's 5s default is too aggressive for queue operations), startup health checks treating gateway 5xx as not-ready-yet within the startup window, module-scoped fixtures for read-only apps, and a 180s per-test budget sized to healthy behavior (the disconnect test runs ~55s by design; container tests build an image). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The shared-client removal dropped the headers those calls inherited. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

socket-security · 2026-07-03T02:22:54Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security
	pypi/pillow@10.4.0
	pypi/pillow@11.3.0
	pypi/grpcio@1.70.0
	pypi/numpy@1.24.4
	pypi/numpy@2.0.2
	pypi/numpy@2.2.6
	pypi/numpy@2.4.6
	pypi/openapi-python-client@0.21.7
	pypi/pyjwt@2.9.0
	pypi/pillow@12.2.0
	pypi/msgpack@1.1.1
	pypi/msgpack@1.1.2
	pypi/sphinx@7.1.2
	pypi/sphinx@7.4.7
	pypi/fastapi@0.128.8
	pypi/fsspec@2025.10.0
	pypi/fsspec@2025.3.0
	pypi/websockets@15.0.1
	pypi/websockets@13.1
	pypi/argcomplete@3.7.0 ⏵ 3.6.3
	pypi/cookiecutter@2.6.0
	pypi/build@1.4.4
	pypi/build@1.2.2.post1
	pypi/uvicorn@0.39.0
	pypi/uvicorn@0.33.0
	pypi/dateparser@1.2.2
	pypi/opentelemetry-api@1.33.1
	pypi/opentelemetry-api@1.41.1
	pypi/aiofiles@24.1.0
	pypi/boto3@1.43.40 ⏵ 1.43.36
	pypi/boto3@1.43.40 ⏵ 1.42.97	⁺¹
	pypi/boto3@1.43.40 ⏵ 1.37.38	⁺¹
	pypi/opentelemetry-sdk@1.41.1
See 25 more rows in the dashboard

View full report

test(e2e): retry register on the operator NOT_FOUND propagation race

027ca6c

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cmlad and others added 10 commits July 2, 2026 02:52

test(e2e): give test_workflows headroom for engine failover and dispa…

732f193

…tcher retries Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(e2e): resubmit when a queued request never dispatches

d3d1159

Same remedy as the chaos suite: requests occasionally sit IN_QUEUE for minutes while other queues flow; a fresh submit goes through. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chore: drop accidentally committed uv.lock

558535f

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(e2e): keep auth headers on direct app requests

29bc3c2

The shared-client removal dropped the headers those calls inherited. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: make preview integration/e2e tests resilient to transient infra errors (INFRA-3802)#1102

test: make preview integration/e2e tests resilient to transient infra errors (INFRA-3802)#1102
cmlad wants to merge 12 commits into
mainfrom
chris/infra-3802-make-preview-tests-rock-solid

cmlad commented Jul 2, 2026 •

edited

Loading

Uh oh!

linear-code Bot commented Jul 2, 2026

Uh oh!

socket-security Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

socket-security Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cmlad commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

src/fal/apps.py — shared HTTP client timeout 5s → 30s (connect 10s)

src/fal/app.py — startup health checks treat 502/503/504 as "not ready yet"

tests/e2e/test_apps.py — module-scoped fixtures for read-only apps

pyproject.toml — per-test timeout 60s → 180s

Uh oh!

linear-code Bot commented Jul 2, 2026

Uh oh!

socket-security Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cmlad commented Jul 2, 2026 •

edited

Loading

`src/fal/apps.py` — shared HTTP client timeout 5s → 30s (connect 10s)

`src/fal/app.py` — startup health checks treat 502/503/504 as "not ready yet"

`tests/e2e/test_apps.py` — module-scoped fixtures for read-only apps

`pyproject.toml` — per-test timeout 60s → 180s

socket-security Bot commented Jul 2, 2026 •

edited

Loading