Skip to content

feat(bot-blocker): probe multiple HTTP clients with a per-client verdict#2696

Open
habansal wants to merge 4 commits into
mainfrom
fix/bot-blocker-multi-client
Open

feat(bot-blocker): probe multiple HTTP clients with a per-client verdict#2696
habansal wants to merge 4 commits into
mainfrom
fix/bot-blocker-multi-client

Conversation

@habansal

Copy link
Copy Markdown
Contributor

What & why

Implements the api-service part of SITES-47217 (datacom.com investigation, SITES-47216).

detect bot-blocker only probed with the @adobe/fetch client — the one Cloudflare Bot Management happens to allow — so it reported "Crawlable: Yes" while the clients that do the real work were 403'd: undici (Node fetch: CWV liveness, preflight, site-detection, imports) and headless Chrome (scraper). The block is decided by the HTTP client's TLS/HTTP fingerprint (JA3/JA4), not the URL/method/UA — proven on datacom.com (same egress IP + URL: @adobe/fetch → 200, undici → 403, headless → 403).

Change

  • New helper detectBotBlockerMultiClient probes with both @adobe/fetch and undici (same UA, so the only variable is the client). Returns the existing shape (crawlable/type/userAgent/ipsToAllowlist) with crawlable now the aggregate (false if any client is blocked), plus a perClient breakdown. Injectable deps for hermetic tests.
  • Wired into the controller, the PLG onboarding flow (so a fingerprint-blocked site is correctly moved to WAITING_FOR_IP_ALLOWLISTING), and the Slack command (per-client output).
  • headless Chrome is not probed here (no browser in api-service); the scraper-backed headless confirmation is a follow-up that depends on the scraper classifying 4xx (separate PR in spacecat-content-scraper).

Test plan

  • Helper unit tests; controller (13); Slack command (22, incl. per-client render); PLG onboarding suite (392) — all green, hermetic.
  • Lint clean.
  • CI green (coverage on Node 24).
  • Stage-verified via @spacecat-dev detect bot-blocker datacom.com (expect undici blocked).

🤖 Generated with Claude Code

detect-bot-blocker only probed with the @adobe/fetch client, which Cloudflare Bot
Management happens to allow — so it reported "Crawlable: Yes" while the clients that
do the real work (undici for CWV liveness / preflight / site-detection / imports, and
headless Chrome for the scraper) were 403'd. The block is decided by the HTTP client's
TLS/HTTP fingerprint, not the URL/method/UA (proven on datacom.com).

- New helper `detectBotBlockerMultiClient`: probes with BOTH @adobe/fetch (shared
  detectBotBlocker) and Node's native fetch (undici), using the same UA so the only
  difference is the client. Returns the existing shape (crawlable/type/userAgent/
  ipsToAllowlist) — `crawlable` is now the aggregate (false if ANY client is blocked) —
  plus a `perClient` breakdown. Deps are injectable for hermetic tests.
- Wired into the controller, the PLG onboarding flow (so onboarding correctly
  waitlists a fingerprint-blocked site), and the Slack command (per-client output).
- headless Chrome is not probed here (no browser in api-service); the scraper-backed
  headless confirmation is a follow-up (depends on the scraper classifying 4xx).

Refs SITES-47217.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@habansal habansal marked this pull request as ready for review June 25, 2026 22:48
@habansal habansal requested a review from MysticatBot June 25, 2026 22:48

@MysticatBot MysticatBot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @habansal,

⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.

Verdict: Request changes - the accidental node_modules symlink must be removed, and there are correctness issues in the aggregation logic.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing to detect Cloudflare TLS fingerprint-based blocks, with per-client breakdown in the API response, Slack output, and PLG onboarding flow (10 files).

Must fix before merge

  1. [Critical] Accidental node_modules symlink committed (local macOS path) - node_modules:1 (details inline)
  2. [Important] ...adobe spread leaks stale reason field when blocker switches to undici - src/support/bot-blocker-multi-client.js:111 (details inline)
  3. [Important] detectBotBlockerFn rejection propagates unhandled while undici failures are caught gracefully - src/support/bot-blocker-multi-client.js:91 (details inline)
  4. [Important] Blocker-preference logic (both clients blocked) has no test coverage - src/support/bot-blocker-multi-client.js:105 (details inline)
Non-blocking (3): minor issues and suggestions
  • nit: headers always forwarded as {} to detectBotBlockerFn in onboarding flow where it was previously absent - verify shared-utils treats {} same as undefined - src/controllers/plg/plg-onboarding/onboarding-flow.js:601
  • nit: test fakes use Map for response headers instead of Headers instance; documents implicit iteration contract - test/support/bot-blocker-multi-client.test.js:27
  • suggestion: pre-existing SSRF surface in Slack command (isValidUrl checks scheme only, no isPublicHostname guard) - consider hardening while touching this file - src/support/slack/commands/detect-bot-blocker.js:79

Comment thread node_modules Outdated
@@ -0,0 +1 @@
/Users/habansal/code/spacecat/spacecat-api-service/node_modules No newline at end of file

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): This symlink points to a developer-local macOS path. It will break CI and every other checkout (dangling symlink). It also bypasses the .gitignore directory rule because it is mode 120000 (symlink file, not a directory).

Fix: git rm --cached node_modules and add a bare node_modules line (without trailing /) to .gitignore to prevent recurrence.

Comment thread src/support/bot-blocker-multi-client.js Outdated
}

return {
...adobe,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): The ...adobe spread places all of adobe fields (including reason) on the return object. The conditional spread on line 115 only adds blocker.reason when truthy - but if blocker is undici and undici has no reason, the adobe reason from the initial spread survives unclobbered. This means the output carries a reason from the wrong client when only undici is blocked.

Scenario: adobe returns { crawlable: true, reason: 'informational', ... }, undici returns { crawlable: false, type: 'cloudflare', reason: undefined }. Output: { crawlable: false, type: 'cloudflare', reason: 'informational' } - the reason describes the non-blocking client.

Fix: Add reason: blocker.reason || undefined as an explicit field (replacing the conditional spread), or delete the ...adobe spread and explicitly list only the fields you intend to forward.

Comment thread src/support/bot-blocker-multi-client.js Outdated
{ log = console, detectBotBlockerFn = detectBotBlocker, fetchFn = fetch } = {},
) {
const [adobe, undici] = await Promise.all([
detectBotBlockerFn({ baseUrl, headers }),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): detectBotBlockerFn has no error wrapper. If it throws (network timeout, DNS failure), Promise.all rejects and the entire function throws - while probeWithUndici gracefully returns an inconclusive result on the same class of failures. The stated design intent is 'inconclusive on failure', but only one probe gets that treatment.

Fix: Wrap the call with a .catch() that returns the same inconclusive shape { crawlable: true, type: 'unknown', confidence: 0.3 } as the undici probe uses. Or switch to Promise.allSettled and handle rejected entries.

// Surface the blocking client's classification at the top level. Prefer the
// @adobe/fetch block (it carries allowlist IPs/UA from the shared probe); fall back
// to the undici block when @adobe/fetch was allowed but undici was not.
let blocker = adobe;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): The blocker-preference logic (let blocker = adobe; if adobe.crawlable and not undici.crawlable then blocker = undici) has a third branch: both blocked. When both are blocked, blocker stays as adobe (intentional per the comment about allowlist IPs). This branch has no test - the next developer who touches this conditional could break the preference without any test failing.

Fix: Add a test where both stubs return crawlable: false with different types and assert that result.type matches the adobe probe type.

@MysticatBot MysticatBot added ai-reviewed Reviewed by AI complexity:high AI-assessed PR complexity: HIGH labels Jun 25, 2026
…son leak, node_modules

- Remove an accidentally-committed `node_modules` symlink (dev-local path that would
  dangle in CI/other checkouts) and tighten .gitignore (`node_modules`, no slash) so a
  symlink-file can't slip past the directory-only rule again.
- Wrap the @adobe/fetch probe so a failure (timeout/DNS) is inconclusive — matching the
  undici probe — instead of rejecting the whole call.
- Always set `reason` from the blocking client (overriding the spread) so the reason
  never describes the non-blocking client.
- Add tests: both-clients-blocked (asserts @adobe/fetch type is preferred) and
  @adobe/fetch probe-failure-is-inconclusive.

Refs SITES-47217.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

This PR will trigger a minor release when merged.

@habansal habansal requested a review from MysticatBot June 26, 2026 01:36

@MysticatBot MysticatBot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @habansal,

⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.

Verdict: Request changes - two material issues remain after the prior findings were addressed.
Complexity: MEDIUM - medium diff; API surface.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (10 files).
Note: CI build is still pending.

Must fix before merge

  1. [Important] Unbounded response body read in undici probe - src/support/bot-blocker-multi-client.js:50 (details inline)
  2. [Important] OpenAPI schema not updated for new perClient field in API response - docs/openapi/bot-blocker-api.yaml (details inline)
Non-blocking (4): minor issues and suggestions
  • suggestion: Add a regression test for reason override when blocker switches from adobe to undici (adobe has a reason, undici has none, assert result.reason is undefined) - test/support/bot-blocker-multi-client.test.js
  • nit: Object.fromEntries(response.headers) drops multi-value headers (only last survives); fine for current detection headers but worth a brief comment - src/support/bot-blocker-multi-client.js:30
  • nit: The ...adobe spread leaks fields like statusCode from the non-blocking probe when blocker is undici; consider an explicit allowlist of forwarded fields - src/support/bot-blocker-multi-client.js:68
  • nit: PLG esmock factory mock does not return perClient in its response shape, masking the contract difference from the real function - test/controllers/plg/plg-onboarding/plg-esmock-factory.js:66

Previously flagged, now resolved

  • node_modules symlink removed, .gitignore fixed to match both files and directories
  • Stale reason field from ...adobe spread now explicitly overridden with reason: blocker.reason || undefined
  • detectBotBlockerFn rejection now wrapped with .catch() returning inconclusive shape
  • Both-blocked branch now has dedicated test asserting @adobe/fetch type is preferred

Comment thread src/support/bot-blocker-multi-client.js Outdated
try {
html = await response.text();
} catch {
html = '';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): response.text() reads the entire response body into memory with no size limit. The AbortSignal.timeout(PROBE_TIMEOUT_MS) covers the connection/headers phase but does not reliably abort body streaming in all Node versions (Node 18 does not propagate the signal to the body reader). A malicious or misconfigured target returning a chunked multi-GB response will spike Lambda memory.

The shared-utils detectBotBlocker caps body reads to 64 KB. This probe should match that contract.

Fix: Add a size guard before consuming the body, e.g. check Content-Length and use a Promise.race with a body-read timeout (3s), or use a streaming reader with a byte cap. The outer try/catch prevents caller-visible failures, but the memory spike is the concern.

@MysticatBot MysticatBot added complexity:medium AI-assessed PR complexity: MEDIUM and removed complexity:high AI-assessed PR complexity: HIGH labels Jun 26, 2026
…read)

Review follow-up: response.text() was unbounded — AbortSignal.timeout covers the
connection/headers phase but does not reliably abort body streaming. Mirror the shared
detectBotBlocker: skip the body when Content-Length exceeds 64 KB, and race the read
against a 3s timeout so a slow or unbounded chunked response can neither hang nor
balloon memory. On skip/timeout the verdict falls back to status+headers.

Adds tests for the oversized-skip and hanging-read (fake-timer) branches.

Refs SITES-47217.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@habansal habansal requested a review from MysticatBot June 26, 2026 02:01

@MysticatBot MysticatBot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @habansal,

⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.

Verdict: Request changes - the OpenAPI schema gap from the prior review remains unresolved.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (10 files).
Note: CI build is still pending.
Note: Recommend a human read before merge - possible divergence from the documented design (heuristic, flagged by the conventions reviewer).

Must fix before merge

  1. [Important] OpenAPI schema not updated for new perClient response field - docs/openapi/schemas.yaml (BotBlockerResult). The API response from src/support/bot-blocker-multi-client.js:131 now includes a perClient object, but the spec does not document it. The project CLAUDE.md requires specification sync ("Keep OpenAPI specs and implementation in sync"). Add the perClient property to BotBlockerResult, then run npm run docs:lint and npm run docs:build.
Non-blocking (4): minor issues and suggestions
  • nit: Content-Length absent (chunked responses) defaults to 0, passing the size guard and reading the body bounded only by the 3s timeout, not by bytes - src/support/bot-blocker-multi-client.js:56
  • nit: ...adobe spread implicitly forwards all fields from the adobe probe; an explicit allowlist (userAgent, ipsToAllowlist) would prevent future fields from leaking with wrong-client provenance - src/support/bot-blocker-multi-client.js:131
  • suggestion: Add a test where adobe has a reason and undici is the blocker with no reason, asserting result.reason is undefined - test/support/bot-blocker-multi-client.test.js
  • nit: PLG esmock factory mock returns old shape (no perClient); no current consumer reads it in onboarding, but the shape divergence may mask a future regression - test/controllers/plg/plg-onboarding/plg-esmock-factory.js:66

Previously flagged, now resolved

  • Unbounded response body read now bounded with Content-Length check (64 KB) and Promise.race timeout (3s)

Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 2m 6s | Cost: $6.60 | Commit: 30c7dab0ec7df8e78f8eb63110a07fb1d0d1ebda
If this code review was useful, please react with 👍. Otherwise, react with 👎.

@MysticatBot MysticatBot added needs-human-review AI reviewer recommends a human read before merge complexity:high AI-assessed PR complexity: HIGH and removed complexity:medium AI-assessed PR complexity: MEDIUM labels Jun 26, 2026
…view nits

Round-3 review follow-up:
- [must-fix] Document the new `perClient` field (and the forwarded `userAgent` /
  `ipsToAllowlist`) on `BotBlockerResult` in the OpenAPI schema; rebuild docs.
- Forward only the fields consumers use (explicit, no blanket `...adobe` spread) so a
  future probe field can't leak with wrong-client provenance.
- Only read the response body when a bounded Content-Length is advertised; chunked /
  absent Content-Length is skipped so an unbounded chunked body can't balloon memory.
- Test: reason is not leaked from the @adobe/fetch probe when only undici blocks.
- Test: chunked-skip (body not read) + Content-Length cap branches.
- PLG esmock factory mock now returns the production `perClient` shape.

Refs SITES-47217.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@habansal habansal requested a review from MysticatBot June 26, 2026 02:31

@MysticatBot MysticatBot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @habansal,

⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.

Verdict: Approve - all prior blocking findings addressed; clean re-review.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (12 files).
Note: CI build is still pending.

Non-blocking (3): minor issues and suggestions
  • suggestion: probeWithUndici has no isNonPublicHostname guard, unlike the shared detectBotBlocker. Since Promise.all fires both probes in parallel, a private/internal URL supplied via the Slack command reaches undici unchecked even though the shared probe rejects it. Consider importing isNonPublicHostname from @adobe/spacecat-shared-utils and returning the inconclusive shape early for private hostnames - src/support/bot-blocker-multi-client.js:43
  • nit: A spoofed Content-Length (e.g. server declares 1000 bytes then streams far more) would let response.text() read beyond 64 KB until the 3s body-read timeout fires. The timeout bounds exposure time but not bytes. A streaming reader with a byte counter would be a tighter cap - src/support/bot-blocker-multi-client.js:57
  • nit: PLG onboarding still applies || botBlockerResult.ipsToWhitelist fallback, but detectBotBlockerMultiClient already normalizes to ipsToAllowlist (the legacy field is never forwarded). The fallback is dead code - src/controllers/plg/plg-onboarding/onboarding-flow.js:614

Previously flagged, now resolved

  • OpenAPI schema updated with perClient, userAgent, ipsToAllowlist fields matching the runtime response shape
  • Body-read guard tightened to skip chunked responses (absent Content-Length parsed as 0 now fails the > 0 check)
  • ...adobe spread replaced with explicit field allowlist preventing wrong-client field leakage
  • PLG esmock factory mirrors production shape including perClient
  • Reason-leak test confirms result.reason is undefined when only undici is blocked

Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 2m 16s | Cost: $5.58 | Commit: a4420cc75ffe5378de334a3a84301a522a0a8aec
If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-reviewed Reviewed by AI complexity:high AI-assessed PR complexity: HIGH needs-human-review AI reviewer recommends a human read before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants