feat(bot-blocker): probe multiple HTTP clients with a per-client verdict#2696
feat(bot-blocker): probe multiple HTTP clients with a per-client verdict#2696habansal wants to merge 4 commits into
Conversation
detect-bot-blocker only probed with the @adobe/fetch client, which Cloudflare Bot Management happens to allow — so it reported "Crawlable: Yes" while the clients that do the real work (undici for CWV liveness / preflight / site-detection / imports, and headless Chrome for the scraper) were 403'd. The block is decided by the HTTP client's TLS/HTTP fingerprint, not the URL/method/UA (proven on datacom.com). - New helper `detectBotBlockerMultiClient`: probes with BOTH @adobe/fetch (shared detectBotBlocker) and Node's native fetch (undici), using the same UA so the only difference is the client. Returns the existing shape (crawlable/type/userAgent/ ipsToAllowlist) — `crawlable` is now the aggregate (false if ANY client is blocked) — plus a `perClient` breakdown. Deps are injectable for hermetic tests. - Wired into the controller, the PLG onboarding flow (so onboarding correctly waitlists a fingerprint-blocked site), and the Slack command (per-client output). - headless Chrome is not probed here (no browser in api-service); the scraper-backed headless confirmation is a follow-up (depends on the scraper classifying 4xx). Refs SITES-47217. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
MysticatBot
left a comment
There was a problem hiding this comment.
Hey @habansal,
⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.
Verdict: Request changes - the accidental node_modules symlink must be removed, and there are correctness issues in the aggregation logic.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing to detect Cloudflare TLS fingerprint-based blocks, with per-client breakdown in the API response, Slack output, and PLG onboarding flow (10 files).
Must fix before merge
- [Critical] Accidental
node_modulessymlink committed (local macOS path) -node_modules:1(details inline) - [Important]
...adobespread leaks stalereasonfield when blocker switches to undici -src/support/bot-blocker-multi-client.js:111(details inline) - [Important]
detectBotBlockerFnrejection propagates unhandled while undici failures are caught gracefully -src/support/bot-blocker-multi-client.js:91(details inline) - [Important] Blocker-preference logic (both clients blocked) has no test coverage -
src/support/bot-blocker-multi-client.js:105(details inline)
Non-blocking (3): minor issues and suggestions
- nit:
headersalways forwarded as{}todetectBotBlockerFnin onboarding flow where it was previously absent - verify shared-utils treats{}same asundefined-src/controllers/plg/plg-onboarding/onboarding-flow.js:601 - nit: test fakes use
Mapfor response headers instead ofHeadersinstance; documents implicit iteration contract -test/support/bot-blocker-multi-client.test.js:27 - suggestion: pre-existing SSRF surface in Slack command (
isValidUrlchecks scheme only, noisPublicHostnameguard) - consider hardening while touching this file -src/support/slack/commands/detect-bot-blocker.js:79
| @@ -0,0 +1 @@ | |||
| /Users/habansal/code/spacecat/spacecat-api-service/node_modules No newline at end of file | |||
There was a problem hiding this comment.
issue (blocking): This symlink points to a developer-local macOS path. It will break CI and every other checkout (dangling symlink). It also bypasses the .gitignore directory rule because it is mode 120000 (symlink file, not a directory).
Fix: git rm --cached node_modules and add a bare node_modules line (without trailing /) to .gitignore to prevent recurrence.
| } | ||
|
|
||
| return { | ||
| ...adobe, |
There was a problem hiding this comment.
issue (blocking): The ...adobe spread places all of adobe fields (including reason) on the return object. The conditional spread on line 115 only adds blocker.reason when truthy - but if blocker is undici and undici has no reason, the adobe reason from the initial spread survives unclobbered. This means the output carries a reason from the wrong client when only undici is blocked.
Scenario: adobe returns { crawlable: true, reason: 'informational', ... }, undici returns { crawlable: false, type: 'cloudflare', reason: undefined }. Output: { crawlable: false, type: 'cloudflare', reason: 'informational' } - the reason describes the non-blocking client.
Fix: Add reason: blocker.reason || undefined as an explicit field (replacing the conditional spread), or delete the ...adobe spread and explicitly list only the fields you intend to forward.
| { log = console, detectBotBlockerFn = detectBotBlocker, fetchFn = fetch } = {}, | ||
| ) { | ||
| const [adobe, undici] = await Promise.all([ | ||
| detectBotBlockerFn({ baseUrl, headers }), |
There was a problem hiding this comment.
issue (blocking): detectBotBlockerFn has no error wrapper. If it throws (network timeout, DNS failure), Promise.all rejects and the entire function throws - while probeWithUndici gracefully returns an inconclusive result on the same class of failures. The stated design intent is 'inconclusive on failure', but only one probe gets that treatment.
Fix: Wrap the call with a .catch() that returns the same inconclusive shape { crawlable: true, type: 'unknown', confidence: 0.3 } as the undici probe uses. Or switch to Promise.allSettled and handle rejected entries.
| // Surface the blocking client's classification at the top level. Prefer the | ||
| // @adobe/fetch block (it carries allowlist IPs/UA from the shared probe); fall back | ||
| // to the undici block when @adobe/fetch was allowed but undici was not. | ||
| let blocker = adobe; |
There was a problem hiding this comment.
issue (blocking): The blocker-preference logic (let blocker = adobe; if adobe.crawlable and not undici.crawlable then blocker = undici) has a third branch: both blocked. When both are blocked, blocker stays as adobe (intentional per the comment about allowlist IPs). This branch has no test - the next developer who touches this conditional could break the preference without any test failing.
Fix: Add a test where both stubs return crawlable: false with different types and assert that result.type matches the adobe probe type.
…son leak, node_modules - Remove an accidentally-committed `node_modules` symlink (dev-local path that would dangle in CI/other checkouts) and tighten .gitignore (`node_modules`, no slash) so a symlink-file can't slip past the directory-only rule again. - Wrap the @adobe/fetch probe so a failure (timeout/DNS) is inconclusive — matching the undici probe — instead of rejecting the whole call. - Always set `reason` from the blocking client (overriding the spread) so the reason never describes the non-blocking client. - Add tests: both-clients-blocked (asserts @adobe/fetch type is preferred) and @adobe/fetch probe-failure-is-inconclusive. Refs SITES-47217. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This PR will trigger a minor release when merged. |
MysticatBot
left a comment
There was a problem hiding this comment.
Hey @habansal,
⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.
Verdict: Request changes - two material issues remain after the prior findings were addressed.
Complexity: MEDIUM - medium diff; API surface.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (10 files).
Note: CI build is still pending.
Must fix before merge
- [Important] Unbounded response body read in undici probe -
src/support/bot-blocker-multi-client.js:50(details inline) - [Important] OpenAPI schema not updated for new
perClientfield in API response -docs/openapi/bot-blocker-api.yaml(details inline)
Non-blocking (4): minor issues and suggestions
- suggestion: Add a regression test for
reasonoverride when blocker switches from adobe to undici (adobe has a reason, undici has none, assert result.reason is undefined) -test/support/bot-blocker-multi-client.test.js - nit:
Object.fromEntries(response.headers)drops multi-value headers (only last survives); fine for current detection headers but worth a brief comment -src/support/bot-blocker-multi-client.js:30 - nit: The
...adobespread leaks fields likestatusCodefrom the non-blocking probe when blocker is undici; consider an explicit allowlist of forwarded fields -src/support/bot-blocker-multi-client.js:68 - nit: PLG esmock factory mock does not return
perClientin its response shape, masking the contract difference from the real function -test/controllers/plg/plg-onboarding/plg-esmock-factory.js:66
Previously flagged, now resolved
- node_modules symlink removed, .gitignore fixed to match both files and directories
- Stale
reasonfield from...adobespread now explicitly overridden withreason: blocker.reason || undefined detectBotBlockerFnrejection now wrapped with.catch()returning inconclusive shape- Both-blocked branch now has dedicated test asserting @adobe/fetch type is preferred
| try { | ||
| html = await response.text(); | ||
| } catch { | ||
| html = ''; |
There was a problem hiding this comment.
issue (blocking): response.text() reads the entire response body into memory with no size limit. The AbortSignal.timeout(PROBE_TIMEOUT_MS) covers the connection/headers phase but does not reliably abort body streaming in all Node versions (Node 18 does not propagate the signal to the body reader). A malicious or misconfigured target returning a chunked multi-GB response will spike Lambda memory.
The shared-utils detectBotBlocker caps body reads to 64 KB. This probe should match that contract.
Fix: Add a size guard before consuming the body, e.g. check Content-Length and use a Promise.race with a body-read timeout (3s), or use a streaming reader with a byte cap. The outer try/catch prevents caller-visible failures, but the memory spike is the concern.
…read) Review follow-up: response.text() was unbounded — AbortSignal.timeout covers the connection/headers phase but does not reliably abort body streaming. Mirror the shared detectBotBlocker: skip the body when Content-Length exceeds 64 KB, and race the read against a 3s timeout so a slow or unbounded chunked response can neither hang nor balloon memory. On skip/timeout the verdict falls back to status+headers. Adds tests for the oversized-skip and hanging-read (fake-timer) branches. Refs SITES-47217. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Hey @habansal,
⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.
Verdict: Request changes - the OpenAPI schema gap from the prior review remains unresolved.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (10 files).
Note: CI build is still pending.
Note: Recommend a human read before merge - possible divergence from the documented design (heuristic, flagged by the conventions reviewer).
Must fix before merge
- [Important] OpenAPI schema not updated for new
perClientresponse field -docs/openapi/schemas.yaml(BotBlockerResult). The API response fromsrc/support/bot-blocker-multi-client.js:131now includes aperClientobject, but the spec does not document it. The project CLAUDE.md requires specification sync ("Keep OpenAPI specs and implementation in sync"). Add theperClientproperty toBotBlockerResult, then runnpm run docs:lintandnpm run docs:build.
Non-blocking (4): minor issues and suggestions
- nit: Content-Length absent (chunked responses) defaults to 0, passing the size guard and reading the body bounded only by the 3s timeout, not by bytes -
src/support/bot-blocker-multi-client.js:56 - nit:
...adobespread implicitly forwards all fields from the adobe probe; an explicit allowlist (userAgent,ipsToAllowlist) would prevent future fields from leaking with wrong-client provenance -src/support/bot-blocker-multi-client.js:131 - suggestion: Add a test where adobe has a
reasonand undici is the blocker with no reason, assertingresult.reasonisundefined-test/support/bot-blocker-multi-client.test.js - nit: PLG esmock factory mock returns old shape (no
perClient); no current consumer reads it in onboarding, but the shape divergence may mask a future regression -test/controllers/plg/plg-onboarding/plg-esmock-factory.js:66
Previously flagged, now resolved
- Unbounded response body read now bounded with Content-Length check (64 KB) and Promise.race timeout (3s)
Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 2m 6s | Cost: $6.60 | Commit: 30c7dab0ec7df8e78f8eb63110a07fb1d0d1ebda
If this code review was useful, please react with 👍. Otherwise, react with 👎.
…view nits Round-3 review follow-up: - [must-fix] Document the new `perClient` field (and the forwarded `userAgent` / `ipsToAllowlist`) on `BotBlockerResult` in the OpenAPI schema; rebuild docs. - Forward only the fields consumers use (explicit, no blanket `...adobe` spread) so a future probe field can't leak with wrong-client provenance. - Only read the response body when a bounded Content-Length is advertised; chunked / absent Content-Length is skipped so an unbounded chunked body can't balloon memory. - Test: reason is not leaked from the @adobe/fetch probe when only undici blocks. - Test: chunked-skip (body not read) + Content-Length cap branches. - PLG esmock factory mock now returns the production `perClient` shape. Refs SITES-47217. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Hey @habansal,
⚠ Degraded review - no spec document was found for this change (searched the PR links, the touched repos' docs, the architecture/guidelines docs, and linked Jira). This review covers code-level quality but could not validate the change against an agreed design, so confidence is reduced. Add a spec link (PR template section 4) and re-request review for a full-confidence pass.
Verdict: Approve - all prior blocking findings addressed; clean re-review.
Complexity: HIGH - medium diff; API surface risk.
Changes: Adds multi-client (adobe-fetch + undici) bot-blocker probing with per-client breakdown, wired into controller, PLG onboarding, and Slack command (12 files).
Note: CI build is still pending.
Non-blocking (3): minor issues and suggestions
- suggestion:
probeWithUndicihas noisNonPublicHostnameguard, unlike the shareddetectBotBlocker. SincePromise.allfires both probes in parallel, a private/internal URL supplied via the Slack command reaches undici unchecked even though the shared probe rejects it. Consider importingisNonPublicHostnamefrom@adobe/spacecat-shared-utilsand returning the inconclusive shape early for private hostnames -src/support/bot-blocker-multi-client.js:43 - nit: A spoofed
Content-Length(e.g. server declares 1000 bytes then streams far more) would letresponse.text()read beyond 64 KB until the 3s body-read timeout fires. The timeout bounds exposure time but not bytes. A streaming reader with a byte counter would be a tighter cap -src/support/bot-blocker-multi-client.js:57 - nit: PLG onboarding still applies
|| botBlockerResult.ipsToWhitelistfallback, butdetectBotBlockerMultiClientalready normalizes toipsToAllowlist(the legacy field is never forwarded). The fallback is dead code -src/controllers/plg/plg-onboarding/onboarding-flow.js:614
Previously flagged, now resolved
- OpenAPI schema updated with
perClient,userAgent,ipsToAllowlistfields matching the runtime response shape - Body-read guard tightened to skip chunked responses (absent Content-Length parsed as 0 now fails the
> 0check) ...adobespread replaced with explicit field allowlist preventing wrong-client field leakage- PLG esmock factory mirrors production shape including
perClient - Reason-leak test confirms
result.reasonis undefined when only undici is blocked
Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 2m 16s | Cost: $5.58 | Commit: a4420cc75ffe5378de334a3a84301a522a0a8aec
If this code review was useful, please react with 👍. Otherwise, react with 👎.
What & why
Implements the api-service part of SITES-47217 (datacom.com investigation, SITES-47216).
detect bot-blockeronly probed with the @adobe/fetch client — the one Cloudflare Bot Management happens to allow — so it reported "Crawlable: Yes" while the clients that do the real work were 403'd: undici (Nodefetch: CWV liveness, preflight, site-detection, imports) and headless Chrome (scraper). The block is decided by the HTTP client's TLS/HTTP fingerprint (JA3/JA4), not the URL/method/UA — proven on datacom.com (same egress IP + URL:@adobe/fetch→ 200,undici→ 403, headless → 403).Change
detectBotBlockerMultiClientprobes with both@adobe/fetchandundici(same UA, so the only variable is the client). Returns the existing shape (crawlable/type/userAgent/ipsToAllowlist) withcrawlablenow the aggregate (false if any client is blocked), plus aperClientbreakdown. Injectable deps for hermetic tests.WAITING_FOR_IP_ALLOWLISTING), and the Slack command (per-client output).Test plan
@spacecat-dev detect bot-blocker datacom.com(expect undici blocked).🤖 Generated with Claude Code