feat(bot-blocker): probe multiple HTTP clients with a per-client verdict by habansal · Pull Request #2696 · adobe/spacecat-api-service

habansal · 2026-06-25T22:35:04Z

What & why

Implements the api-service part of SITES-47217 (datacom.com investigation, SITES-47216).

detect bot-blocker only probed with the @adobe/fetch client — the one Cloudflare Bot Management happens to allow — so it reported "Crawlable: Yes" while the clients that do the real work were 403'd: undici (Node fetch: CWV liveness, preflight, site-detection, imports) and headless Chrome (scraper). The block is decided by the HTTP client's TLS/HTTP fingerprint (JA3/JA4), not the URL/method/UA — proven on datacom.com (same egress IP + URL: @adobe/fetch → 200, undici → 403, headless → 403).

Change

New helper detectBotBlockerMultiClient probes with both @adobe/fetch and undici (same UA, so the only variable is the client). Returns the existing shape (crawlable/type/userAgent/ipsToAllowlist) with crawlable now the aggregate (false if any client is blocked), plus a perClient breakdown. Injectable deps for hermetic tests.
Wired into the controller, the PLG onboarding flow (so a fingerprint-blocked site is correctly moved to WAITING_FOR_IP_ALLOWLISTING), and the Slack command (per-client output).
headless Chrome is not probed here (no browser in api-service); the scraper-backed headless confirmation is a follow-up that depends on the scraper classifying 4xx (separate PR in spacecat-content-scraper).

Test plan

Helper unit tests; controller (13); Slack command (22, incl. per-client render); PLG onboarding suite (392) — all green, hermetic.
Lint clean.
CI green (coverage on Node 24).
Stage-verified via @spacecat-dev detect bot-blocker datacom.com (expect undici blocked).

🤖 Generated with Claude Code

detect-bot-blocker only probed with the @adobe/fetch client, which Cloudflare Bot Management happens to allow — so it reported "Crawlable: Yes" while the clients that do the real work (undici for CWV liveness / preflight / site-detection / imports, and headless Chrome for the scraper) were 403'd. The block is decided by the HTTP client's TLS/HTTP fingerprint, not the URL/method/UA (proven on datacom.com). - New helper `detectBotBlockerMultiClient`: probes with BOTH @adobe/fetch (shared detectBotBlocker) and Node's native fetch (undici), using the same UA so the only difference is the client. Returns the existing shape (crawlable/type/userAgent/ ipsToAllowlist) — `crawlable` is now the aggregate (false if ANY client is blocked) — plus a `perClient` breakdown. Deps are injectable for hermetic tests. - Wired into the controller, the PLG onboarding flow (so onboarding correctly waitlists a fingerprint-blocked site), and the Slack command (per-client output). - headless Chrome is not probed here (no browser in api-service); the scraper-backed headless confirmation is a follow-up (depends on the scraper classifying 4xx). Refs SITES-47217. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-25T22:40:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

MysticatBot

Hey @habansal,

⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.

Verdict: Request changes - the accidental node_modules symlink must be removed, and there are correctness issues in the aggregation logic.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing to detect Cloudflare TLS fingerprint-based blocks, with per-client breakdown in the API response, Slack output, and PLG onboarding flow (10 files).

Must fix before merge

[Critical] Accidental node_modules symlink committed (local macOS path) - node_modules:1 (details inline)
[Important] ...adobe spread leaks stale reason field when blocker switches to undici - src/support/bot-blocker-multi-client.js:111 (details inline)
[Important] detectBotBlockerFn rejection propagates unhandled while undici failures are caught gracefully - src/support/bot-blocker-multi-client.js:91 (details inline)
[Important] Blocker-preference logic (both clients blocked) has no test coverage - src/support/bot-blocker-multi-client.js:105 (details inline)

Non-blocking (3): minor issues and suggestions

nit: headers always forwarded as {} to detectBotBlockerFn in onboarding flow where it was previously absent - verify shared-utils treats {} same as undefined - src/controllers/plg/plg-onboarding/onboarding-flow.js:601
nit: test fakes use Map for response headers instead of Headers instance; documents implicit iteration contract - test/support/bot-blocker-multi-client.test.js:27
suggestion: pre-existing SSRF surface in Slack command (isValidUrl checks scheme only, no isPublicHostname guard) - consider hardening while touching this file - src/support/slack/commands/detect-bot-blocker.js:79

MysticatBot · 2026-06-25T23:01:22Z

@@ -0,0 +1 @@
+/Users/habansal/code/spacecat/spacecat-api-service/node_modules


issue (blocking): This symlink points to a developer-local macOS path. It will break CI and every other checkout (dangling symlink). It also bypasses the .gitignore directory rule because it is mode 120000 (symlink file, not a directory).

Fix: git rm --cached node_modules and add a bare node_modules line (without trailing /) to .gitignore to prevent recurrence.

MysticatBot · 2026-06-25T23:01:28Z

+  }
+
+  return {
+    ...adobe,


issue (blocking): The ...adobe spread places all of adobe fields (including reason) on the return object. The conditional spread on line 115 only adds blocker.reason when truthy - but if blocker is undici and undici has no reason, the adobe reason from the initial spread survives unclobbered. This means the output carries a reason from the wrong client when only undici is blocked.

Scenario: adobe returns { crawlable: true, reason: 'informational', ... }, undici returns { crawlable: false, type: 'cloudflare', reason: undefined }. Output: { crawlable: false, type: 'cloudflare', reason: 'informational' } - the reason describes the non-blocking client.

Fix: Add reason: blocker.reason || undefined as an explicit field (replacing the conditional spread), or delete the ...adobe spread and explicitly list only the fields you intend to forward.

MysticatBot · 2026-06-25T23:01:35Z

+  { log = console, detectBotBlockerFn = detectBotBlocker, fetchFn = fetch } = {},
+) {
+  const [adobe, undici] = await Promise.all([
+    detectBotBlockerFn({ baseUrl, headers }),


issue (blocking): detectBotBlockerFn has no error wrapper. If it throws (network timeout, DNS failure), Promise.all rejects and the entire function throws - while probeWithUndici gracefully returns an inconclusive result on the same class of failures. The stated design intent is 'inconclusive on failure', but only one probe gets that treatment.

Fix: Wrap the call with a .catch() that returns the same inconclusive shape { crawlable: true, type: 'unknown', confidence: 0.3 } as the undici probe uses. Or switch to Promise.allSettled and handle rejected entries.

MysticatBot · 2026-06-25T23:01:44Z

+  // Surface the blocking client's classification at the top level. Prefer the
+  // @adobe/fetch block (it carries allowlist IPs/UA from the shared probe); fall back
+  // to the undici block when @adobe/fetch was allowed but undici was not.
+  let blocker = adobe;


issue (blocking): The blocker-preference logic (let blocker = adobe; if adobe.crawlable and not undici.crawlable then blocker = undici) has a third branch: both blocked. When both are blocked, blocker stays as adobe (intentional per the comment about allowlist IPs). This branch has no test - the next developer who touches this conditional could break the preference without any test failing.

Fix: Add a test where both stubs return crawlable: false with different types and assert that result.type matches the adobe probe type.

…son leak, node_modules - Remove an accidentally-committed `node_modules` symlink (dev-local path that would dangle in CI/other checkouts) and tighten .gitignore (`node_modules`, no slash) so a symlink-file can't slip past the directory-only rule again. - Wrap the @adobe/fetch probe so a failure (timeout/DNS) is inconclusive — matching the undici probe — instead of rejecting the whole call. - Always set `reason` from the blocking client (overriding the spread) so the reason never describes the non-blocking client. - Add tests: both-clients-blocked (asserts @adobe/fetch type is preferred) and @adobe/fetch probe-failure-is-inconclusive. Refs SITES-47217. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-26T01:36:22Z

This PR will trigger a minor release when merged.

MysticatBot

Hey @habansal,

⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.

Verdict: Request changes - two material issues remain after the prior findings were addressed.
Complexity: MEDIUM - medium diff; API surface.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (10 files).
Note: CI build is still pending.

Must fix before merge

[Important] Unbounded response body read in undici probe - src/support/bot-blocker-multi-client.js:50 (details inline)
[Important] OpenAPI schema not updated for new perClient field in API response - docs/openapi/bot-blocker-api.yaml (details inline)

Non-blocking (4): minor issues and suggestions

suggestion: Add a regression test for reason override when blocker switches from adobe to undici (adobe has a reason, undici has none, assert result.reason is undefined) - test/support/bot-blocker-multi-client.test.js
nit: Object.fromEntries(response.headers) drops multi-value headers (only last survives); fine for current detection headers but worth a brief comment - src/support/bot-blocker-multi-client.js:30
nit: The ...adobe spread leaks fields like statusCode from the non-blocking probe when blocker is undici; consider an explicit allowlist of forwarded fields - src/support/bot-blocker-multi-client.js:68
nit: PLG esmock factory mock does not return perClient in its response shape, masking the contract difference from the real function - test/controllers/plg/plg-onboarding/plg-esmock-factory.js:66

Previously flagged, now resolved

node_modules symlink removed, .gitignore fixed to match both files and directories
Stale reason field from ...adobe spread now explicitly overridden with reason: blocker.reason || undefined
detectBotBlockerFn rejection now wrapped with .catch() returning inconclusive shape
Both-blocked branch now has dedicated test asserting @adobe/fetch type is preferred

MysticatBot · 2026-06-26T01:46:25Z

+    try {
+      html = await response.text();
+    } catch {
+      html = '';


issue (blocking): response.text() reads the entire response body into memory with no size limit. The AbortSignal.timeout(PROBE_TIMEOUT_MS) covers the connection/headers phase but does not reliably abort body streaming in all Node versions (Node 18 does not propagate the signal to the body reader). A malicious or misconfigured target returning a chunked multi-GB response will spike Lambda memory.

The shared-utils detectBotBlocker caps body reads to 64 KB. This probe should match that contract.

Fix: Add a size guard before consuming the body, e.g. check Content-Length and use a Promise.race with a body-read timeout (3s), or use a streaming reader with a byte cap. The outer try/catch prevents caller-visible failures, but the memory spike is the concern.

…read) Review follow-up: response.text() was unbounded — AbortSignal.timeout covers the connection/headers phase but does not reliably abort body streaming. Mirror the shared detectBotBlocker: skip the body when Content-Length exceeds 64 KB, and race the read against a 3s timeout so a slow or unbounded chunked response can neither hang nor balloon memory. On skip/timeout the verdict falls back to status+headers. Adds tests for the oversized-skip and hanging-read (fake-timer) branches. Refs SITES-47217. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MysticatBot

Hey @habansal,

⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.

Verdict: Request changes - the OpenAPI schema gap from the prior review remains unresolved.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (10 files).
Note: CI build is still pending.
Note: Recommend a human read before merge - possible divergence from the documented design (heuristic, flagged by the conventions reviewer).

Must fix before merge

[Important] OpenAPI schema not updated for new perClient response field - docs/openapi/schemas.yaml (BotBlockerResult). The API response from src/support/bot-blocker-multi-client.js:131 now includes a perClient object, but the spec does not document it. The project CLAUDE.md requires specification sync ("Keep OpenAPI specs and implementation in sync"). Add the perClient property to BotBlockerResult, then run npm run docs:lint and npm run docs:build.

Non-blocking (4): minor issues and suggestions

nit: Content-Length absent (chunked responses) defaults to 0, passing the size guard and reading the body bounded only by the 3s timeout, not by bytes - src/support/bot-blocker-multi-client.js:56
nit: ...adobe spread implicitly forwards all fields from the adobe probe; an explicit allowlist (userAgent, ipsToAllowlist) would prevent future fields from leaking with wrong-client provenance - src/support/bot-blocker-multi-client.js:131
suggestion: Add a test where adobe has a reason and undici is the blocker with no reason, asserting result.reason is undefined - test/support/bot-blocker-multi-client.test.js
nit: PLG esmock factory mock returns old shape (no perClient); no current consumer reads it in onboarding, but the shape divergence may mask a future regression - test/controllers/plg/plg-onboarding/plg-esmock-factory.js:66

Previously flagged, now resolved

Unbounded response body read now bounded with Content-Length check (64 KB) and Promise.race timeout (3s)

_{Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 2m 6s | Cost: $6.60 | Commit: 30c7dab0ec7df8e78f8eb63110a07fb1d0d1ebda
If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…view nits Round-3 review follow-up: - [must-fix] Document the new `perClient` field (and the forwarded `userAgent` / `ipsToAllowlist`) on `BotBlockerResult` in the OpenAPI schema; rebuild docs. - Forward only the fields consumers use (explicit, no blanket `...adobe` spread) so a future probe field can't leak with wrong-client provenance. - Only read the response body when a bounded Content-Length is advertised; chunked / absent Content-Length is skipped so an unbounded chunked body can't balloon memory. - Test: reason is not leaked from the @adobe/fetch probe when only undici blocks. - Test: chunked-skip (body not read) + Content-Length cap branches. - PLG esmock factory mock now returns the production `perClient` shape. Refs SITES-47217. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MysticatBot

Hey @habansal,

⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.

Verdict: Approve - all prior blocking findings addressed; clean re-review.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (12 files).
Note: CI build is still pending.

Non-blocking (3): minor issues and suggestions

suggestion: probeWithUndici has no isNonPublicHostname guard, unlike the shared detectBotBlocker. Since Promise.all fires both probes in parallel, a private/internal URL supplied via the Slack command reaches undici unchecked even though the shared probe rejects it. Consider importing isNonPublicHostname from @adobe/spacecat-shared-utils and returning the inconclusive shape early for private hostnames - src/support/bot-blocker-multi-client.js:43
nit: A spoofed Content-Length (e.g. server declares 1000 bytes then streams far more) would let response.text() read beyond 64 KB until the 3s body-read timeout fires. The timeout bounds exposure time but not bytes. A streaming reader with a byte counter would be a tighter cap - src/support/bot-blocker-multi-client.js:57
nit: PLG onboarding still applies || botBlockerResult.ipsToWhitelist fallback, but detectBotBlockerMultiClient already normalizes to ipsToAllowlist (the legacy field is never forwarded). The fallback is dead code - src/controllers/plg/plg-onboarding/onboarding-flow.js:614

Previously flagged, now resolved

OpenAPI schema updated with perClient, userAgent, ipsToAllowlist fields matching the runtime response shape
Body-read guard tightened to skip chunked responses (absent Content-Length parsed as 0 now fails the > 0 check)
...adobe spread replaced with explicit field allowlist preventing wrong-client field leakage
PLG esmock factory mirrors production shape including perClient
Reason-leak test confirms result.reason is undefined when only undici is blocked

_{Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 2m 16s | Cost: $5.58 | Commit: a4420cc75ffe5378de334a3a84301a522a0a8aec
If this code review was useful, please react with 👍. Otherwise, react with 👎.}

habansal temporarily deployed to dev-branches June 25, 2026 22:45 — with GitHub Actions Inactive

habansal marked this pull request as ready for review June 25, 2026 22:48

habansal requested a review from MysticatBot June 25, 2026 22:48

MysticatBot requested changes Jun 25, 2026

View reviewed changes

MysticatBot reviewed Jun 25, 2026

View reviewed changes

MysticatBot added ai-reviewed Reviewed by AI complexity:high AI-assessed PR complexity: HIGH labels Jun 25, 2026

habansal requested a review from MysticatBot June 26, 2026 01:36

MysticatBot requested changes Jun 26, 2026

View reviewed changes

MysticatBot reviewed Jun 26, 2026

View reviewed changes

MysticatBot added complexity:medium AI-assessed PR complexity: MEDIUM and removed complexity:high AI-assessed PR complexity: HIGH labels Jun 26, 2026

habansal temporarily deployed to dev-branches June 26, 2026 01:47 — with GitHub Actions Inactive

habansal requested a review from MysticatBot June 26, 2026 02:01

habansal temporarily deployed to dev-branches June 26, 2026 02:11 — with GitHub Actions Inactive

MysticatBot requested changes Jun 26, 2026

View reviewed changes

MysticatBot added needs-human-review AI reviewer recommends a human read before merge complexity:high AI-assessed PR complexity: HIGH and removed complexity:medium AI-assessed PR complexity: MEDIUM labels Jun 26, 2026

habansal requested a review from MysticatBot June 26, 2026 02:31

habansal temporarily deployed to dev-branches June 26, 2026 02:42 — with GitHub Actions Inactive

MysticatBot approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(bot-blocker): probe multiple HTTP clients with a per-client verdict#2696

feat(bot-blocker): probe multiple HTTP clients with a per-client verdict#2696
habansal wants to merge 4 commits into
mainfrom
fix/bot-blocker-multi-client

habansal commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026

Uh oh!

MysticatBot left a comment

Uh oh!

MysticatBot Jun 25, 2026

Uh oh!

MysticatBot Jun 25, 2026

Uh oh!

MysticatBot Jun 25, 2026

Uh oh!

MysticatBot Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

MysticatBot left a comment

Uh oh!

MysticatBot Jun 26, 2026

Uh oh!

MysticatBot left a comment •

edited

Loading

Uh oh!

MysticatBot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1 @@
		/Users/habansal/code/spacecat/spacecat-api-service/node_modules No newline at end of file

Uh oh!

Conversation

habansal commented Jun 25, 2026

What & why

Change

Test plan

Uh oh!

codecov Bot commented Jun 25, 2026

Codecov Report

Uh oh!

MysticatBot left a comment

Choose a reason for hiding this comment

Must fix before merge

Uh oh!

MysticatBot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

MysticatBot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

MysticatBot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

MysticatBot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

MysticatBot left a comment

Choose a reason for hiding this comment

Must fix before merge

Previously flagged, now resolved

Uh oh!

MysticatBot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

MysticatBot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Must fix before merge

Previously flagged, now resolved

Uh oh!

MysticatBot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Previously flagged, now resolved

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MysticatBot left a comment •

edited

Loading

MysticatBot left a comment •

edited

Loading