test: make preview integration/e2e tests resilient to transient infra errors (INFRA-3802)#1102
test: make preview integration/e2e tests resilient to transient infra errors (INFRA-3802)#1102cmlad wants to merge 12 commits into
Conversation
… errors (INFRA-3802) Preview deployment CI fails ~74% of the time, almost always on infrastructure-level transients rather than product regressions: - fal/apps.py: retry queue status/result/cancel requests on dropped keep-alive connections, connect errors and transient 5xx; submits only retry when the request was definitely not sent. Raise the shared HTTP client timeout from the httpx default of 5s to 30s. - fal/app.py: treat 502/503/504 during startup health checks as retryable (e.g. runner_connection_error while the runner comes up). - e2e tests: retry app registration and alias deletion on transient control-plane errors (the preview control plane is restarted right before the tests run); route direct httpx calls through a helper that retries transport errors and infra 502/503s. - integration tests: mark the runner-cold-start tests flaky (Nomad docker-auth failures on individual nodes kill single allocations). - pytest: raise per-test timeout 60s -> 120s; test_app_disconnect_behavior legitimately takes ~55s and was flirting with the limit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
…tcher retries Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A wedged connection used to stall each poll for the full 30s client timeout (x3 retry attempts); polls answer in milliseconds and repeat in a loop, so time out fast and let the retry reconnect. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Cold starts on the shared dev fleet regularly exceed 120s during busy windows (8 run_in_worker spans over 60s in one 40-minute window), and container tests rebuild their image whenever the git revision changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The dev fleet sometimes schedules preview runners on remote-region nodes (observed boostrun us-east), making every cold start pay cross-region image/env pulls of 3-4 minutes. The suites run under xdist now, so a generous per-test ceiling no longer threatens the job budget and only matters when the fleet is degraded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The gateway occasionally drops a websocket with 1011 'Error while forwarding the request' (transient runner-forwarding failure); the ws protocol has no request-level retry, so rerun the test instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same remedy as the chaos suite: requests occasionally sit IN_QUEUE for minutes while other queues flow; a fresh submit goes through. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Every function-scoped app costs a registration plus a runner cold start; during busy dev-fleet windows cold-start throughput collapses (observed 30 starts in 82 minutes) and the suite cannot finish inside any reasonable budget. The addition, stateful, cancellable and realtime apps are only read by their tests, so register them once per module (per xdist worker). Runner-lifecycle tests (stop/kill/rollout) keep their function-scoped apps since they assert on exact runner counts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per review: tests must pass on first try against healthy infra. Removed the HTTP retry helper and call-site wrappers, register/delete retries, resubmit-on-stuck-queue, the flaky marks added for infra noise, and the degraded-env test budgets. Kept the logical fixes: 30s client timeout (httpx's 5s default is too aggressive for queue operations), startup health checks treating gateway 5xx as not-ready-yet within the startup window, module-scoped fixtures for read-only apps, and a 180s per-test budget sized to healthy behavior (the disconnect test runs ~55s by design; container tests build an image). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The shared-client removal dropped the headers those calls inherited. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Part of INFRA-3802 (make the isolate-cloud preview integration tests rock solid). Every change below fixes a concrete failure observed in preview CI runs; the infra-side fixes live in fal-ai/isolate-cloud#8202. No retries — tests run once and rely on correct conditions/timeouts.
src/fal/apps.py— shared HTTP client timeout 5s → 30s (connect 10s)The queue client (
apps.run/submit/status) was created with no explicit timeout, inheriting httpx's default of 5 seconds for connect/read/write. Queue submits and status/result fetches legitimately exceed 5s on a busy cluster (the server-side workflow engine used 20s for the same calls and still hit timeouts). A read that takes 6 seconds is latency, not a failure; with the default, it raisedhttpx.ReadTimeoutand failed tests (and user code) spuriously. 30s read / 10s connect keeps genuine hangs bounded while not misclassifying slow-but-successful responses.src/fal/app.py— startup health checks treat 502/503/504 as "not ready yet"AppInfo.wait()polls the app's/healthendpoint untilstartup_timeout, retrying on 500/404 ("server not ready") but treating any other status as fatal. During startup the platform gateway returns503 {"error_type": "runner_connection_error"}while the runner is booting but not yet reachable — a transient readiness state by definition, observed abortingAppClient.connect()in preview runs (e.g.test_404_billable_units). Gateway-level 502/503/504 now count as not-ready within the existing startup window; genuinely non-retryable statuses (401/403/422/…) still raise immediately. No new retry mechanism — this only corrects which statuses the existing readiness poll classifies as "still starting".tests/e2e/test_apps.py— module-scoped fixtures for read-only appstest_app(addition),test_stateful_app,test_cancellable_appandtest_realtime_appwere function-scoped: ~18 registrations + runner cold starts per suite run for apps whose tests never mutate the deployment. Cold starts dominate suite wall clock (during degraded fleet windows we measured 30 runner starts per 80 minutes, which made the suite unfinishable). These four now register once per module via a newmodule_register_appfixture; their tests only submit requests (the stateful tests reset the counter first, the cancellable tests cancel only their own requests). Fixtures whose tests kill/stop/roll out runners or mutate the deployment (test_sleep_app,base_app, deploy/scale tests) stay function-scoped because they assert on exact runner counts and revisions.Mechanical side effect: the tests that used a shared
httpx.Client(headers=_auth_headers())block now callhttpx.post(...)directly with explicitheaders=_auth_headers()— same requests, less nesting.pyproject.toml— per-test timeout 60s → 180sThe 60s budget was below the by-design runtime of existing tests against a healthy platform:
test_app_disconnect_behaviorperforms two intentional 6s-wait/504 round-trips plus polling (~55s measured, with 50s faulthandler dumps in passing runs), and the container tests build a Docker image whose dockerfile embeds the git revision, so every new commit triggers a fresh 1–3 minute image build. 180s covers real healthy behavior with margin; it is deliberately not sized to degraded-fleet windows — if the platform is unhealthy, tests should fail, not wait it out.faulthandler_timeoutmoves to 170s so pre-timeout stack dumps still fire.🤖 Generated with Claude Code