Skip to content

test: make preview integration/e2e tests resilient to transient infra errors (INFRA-3802)#1102

Draft
cmlad wants to merge 12 commits into
mainfrom
chris/infra-3802-make-preview-tests-rock-solid
Draft

test: make preview integration/e2e tests resilient to transient infra errors (INFRA-3802)#1102
cmlad wants to merge 12 commits into
mainfrom
chris/infra-3802-make-preview-tests-rock-solid

Conversation

@cmlad

@cmlad cmlad commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Part of INFRA-3802 (make the isolate-cloud preview integration tests rock solid). Every change below fixes a concrete failure observed in preview CI runs; the infra-side fixes live in fal-ai/isolate-cloud#8202. No retries — tests run once and rely on correct conditions/timeouts.

src/fal/apps.py — shared HTTP client timeout 5s → 30s (connect 10s)

The queue client (apps.run/submit/status) was created with no explicit timeout, inheriting httpx's default of 5 seconds for connect/read/write. Queue submits and status/result fetches legitimately exceed 5s on a busy cluster (the server-side workflow engine used 20s for the same calls and still hit timeouts). A read that takes 6 seconds is latency, not a failure; with the default, it raised httpx.ReadTimeout and failed tests (and user code) spuriously. 30s read / 10s connect keeps genuine hangs bounded while not misclassifying slow-but-successful responses.

src/fal/app.py — startup health checks treat 502/503/504 as "not ready yet"

AppInfo.wait() polls the app's /health endpoint until startup_timeout, retrying on 500/404 ("server not ready") but treating any other status as fatal. During startup the platform gateway returns 503 {"error_type": "runner_connection_error"} while the runner is booting but not yet reachable — a transient readiness state by definition, observed aborting AppClient.connect() in preview runs (e.g. test_404_billable_units). Gateway-level 502/503/504 now count as not-ready within the existing startup window; genuinely non-retryable statuses (401/403/422/…) still raise immediately. No new retry mechanism — this only corrects which statuses the existing readiness poll classifies as "still starting".

tests/e2e/test_apps.py — module-scoped fixtures for read-only apps

test_app (addition), test_stateful_app, test_cancellable_app and test_realtime_app were function-scoped: ~18 registrations + runner cold starts per suite run for apps whose tests never mutate the deployment. Cold starts dominate suite wall clock (during degraded fleet windows we measured 30 runner starts per 80 minutes, which made the suite unfinishable). These four now register once per module via a new module_register_app fixture; their tests only submit requests (the stateful tests reset the counter first, the cancellable tests cancel only their own requests). Fixtures whose tests kill/stop/roll out runners or mutate the deployment (test_sleep_app, base_app, deploy/scale tests) stay function-scoped because they assert on exact runner counts and revisions.

Mechanical side effect: the tests that used a shared httpx.Client(headers=_auth_headers()) block now call httpx.post(...) directly with explicit headers=_auth_headers() — same requests, less nesting.

pyproject.toml — per-test timeout 60s → 180s

The 60s budget was below the by-design runtime of existing tests against a healthy platform: test_app_disconnect_behavior performs two intentional 6s-wait/504 round-trips plus polling (~55s measured, with 50s faulthandler dumps in passing runs), and the container tests build a Docker image whose dockerfile embeds the git revision, so every new commit triggers a fresh 1–3 minute image build. 180s covers real healthy behavior with margin; it is deliberately not sized to degraded-fleet windows — if the platform is unhealthy, tests should fail, not wait it out. faulthandler_timeout moves to 170s so pre-timeout stack dumps still fire.

🤖 Generated with Claude Code

… errors (INFRA-3802)

Preview deployment CI fails ~74% of the time, almost always on
infrastructure-level transients rather than product regressions:

- fal/apps.py: retry queue status/result/cancel requests on dropped
  keep-alive connections, connect errors and transient 5xx; submits only
  retry when the request was definitely not sent. Raise the shared HTTP
  client timeout from the httpx default of 5s to 30s.
- fal/app.py: treat 502/503/504 during startup health checks as
  retryable (e.g. runner_connection_error while the runner comes up).
- e2e tests: retry app registration and alias deletion on transient
  control-plane errors (the preview control plane is restarted right
  before the tests run); route direct httpx calls through a helper that
  retries transport errors and infra 502/503s.
- integration tests: mark the runner-cold-start tests flaky (Nomad
  docker-auth failures on individual nodes kill single allocations).
- pytest: raise per-test timeout 60s -> 120s; test_app_disconnect_behavior
  legitimately takes ~55s and was flirting with the limit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jul 2, 2026

Copy link
Copy Markdown

INFRA-3802

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@socket-security

socket-security Bot commented Jul 2, 2026

Copy link
Copy Markdown

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn High
Obfuscated code: pypi fastapi is 90.0% likely obfuscated

Confidence: 0.90

Location: Package overview

From: projects/fal/pyproject.tomlpypi/fastapi@0.128.8

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore pypi/fastapi@0.128.8. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

Warn High
Obfuscated code: pypi fastapi is 90.0% likely obfuscated

Confidence: 0.90

Location: Package overview

From: projects/fal/pyproject.tomlpypi/fastapi@0.128.8

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore pypi/fastapi@0.128.8. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

Warn High
Obfuscated code: pypi numpy is 90.0% likely obfuscated

Confidence: 0.90

Location: Package overview

From: projects/fal/pyproject.tomlpypi/numpy@2.4.6

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore pypi/numpy@2.4.6. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

Warn High
Obfuscated code: pypi pycparser is 90.0% likely obfuscated

Confidence: 0.90

Location: Package overview

From: projects/fal/uv.lockpypi/pyzmq@27.1.0pypi/pycparser@3.0

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore pypi/pycparser@3.0. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

Warn High
Obfuscated code: pypi pycparser is 90.0% likely obfuscated

Confidence: 0.90

Location: Package overview

From: projects/fal/uv.lockpypi/pyzmq@27.1.0pypi/pycparser@3.0

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore pypi/pycparser@3.0. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

cmlad and others added 10 commits July 2, 2026 02:52
…tcher retries

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A wedged connection used to stall each poll for the full 30s client
timeout (x3 retry attempts); polls answer in milliseconds and repeat in
a loop, so time out fast and let the retry reconnect.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Cold starts on the shared dev fleet regularly exceed 120s during busy
windows (8 run_in_worker spans over 60s in one 40-minute window), and
container tests rebuild their image whenever the git revision changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The dev fleet sometimes schedules preview runners on remote-region
nodes (observed boostrun us-east), making every cold start pay
cross-region image/env pulls of 3-4 minutes. The suites run under
xdist now, so a generous per-test ceiling no longer threatens the job
budget and only matters when the fleet is degraded.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The gateway occasionally drops a websocket with 1011 'Error while
forwarding the request' (transient runner-forwarding failure); the ws
protocol has no request-level retry, so rerun the test instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same remedy as the chaos suite: requests occasionally sit IN_QUEUE for
minutes while other queues flow; a fresh submit goes through.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Every function-scoped app costs a registration plus a runner cold
start; during busy dev-fleet windows cold-start throughput collapses
(observed 30 starts in 82 minutes) and the suite cannot finish inside
any reasonable budget. The addition, stateful, cancellable and
realtime apps are only read by their tests, so register them once per
module (per xdist worker). Runner-lifecycle tests (stop/kill/rollout)
keep their function-scoped apps since they assert on exact runner
counts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per review: tests must pass on first try against healthy infra.
Removed the HTTP retry helper and call-site wrappers, register/delete
retries, resubmit-on-stuck-queue, the flaky marks added for infra
noise, and the degraded-env test budgets. Kept the logical fixes:
30s client timeout (httpx's 5s default is too aggressive for queue
operations), startup health checks treating gateway 5xx as
not-ready-yet within the startup window, module-scoped fixtures for
read-only apps, and a 180s per-test budget sized to healthy behavior
(the disconnect test runs ~55s by design; container tests build an
image).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The shared-client removal dropped the headers those calls inherited.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant