-
Notifications
You must be signed in to change notification settings - Fork 13
feat(bot-blocker): probe multiple HTTP clients with a per-client verdict #2696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
28b44cb
6791025
30c7dab
a4420cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| coverage | ||
| .nyc_output/ | ||
| node_modules/ | ||
| node_modules | ||
| junit | ||
| dist | ||
| tmp | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| /* | ||
| * Copyright 2026 Adobe. All rights reserved. | ||
| * This file is licensed to you under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. You may obtain a copy | ||
| * of the License at http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software distributed under | ||
| * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS | ||
| * OF ANY KIND, either express or implied. See the License for the specific language | ||
| * governing permissions and limitations under the License. | ||
| */ | ||
|
|
||
| import { | ||
| detectBotBlocker, analyzeBotProtection, SPACECAT_USER_AGENT, | ||
| } from '@adobe/spacecat-shared-utils'; | ||
|
|
||
| const PROBE_TIMEOUT_MS = 10000; | ||
| // Bound the body read the same way the shared detectBotBlocker does: skip the body | ||
| // when Content-Length is large, and race the read against a short timeout so a slow | ||
| // (or unbounded chunked) response can never hang or balloon memory. | ||
| const BODY_READ_MAX_BYTES = 65536; // 64 KB — challenge markers appear in the first KB | ||
| const BODY_READ_TIMEOUT_MS = 3000; | ||
|
|
||
| /** | ||
| * Probes a URL with Node's native fetch (undici) and classifies the response. | ||
| * | ||
| * undici is the HTTP client used by CWV liveness, preflight, site-detection, and the | ||
| * import-worker. Cloudflare Bot Management fingerprints the client (TLS/HTTP, JA3/JA4), | ||
| * so a site can allow the @adobe/fetch client while blocking undici (and headless | ||
| * Chrome). We send the same User-Agent the @adobe/fetch probe uses so the ONLY | ||
| * difference between the two probes is the client itself. | ||
| * | ||
| * A request we cannot complete (timeout/network) is reported as inconclusive | ||
| * (crawlable, low confidence) rather than blocked — we only assert a block when the | ||
| * response actually classifies as one. | ||
| * | ||
| * @param {string} baseUrl - URL to probe. | ||
| * @param {Object} headers - Optional extra headers (e.g. site scraper headers). | ||
| * @param {Object} log - Logger. | ||
| * @returns {Promise<Object>} analyzeBotProtection result { crawlable, type, confidence }. | ||
| */ | ||
| async function probeWithUndici(baseUrl, headers, log, fetchFn) { | ||
| try { | ||
| const response = await fetchFn(baseUrl, { | ||
| method: 'GET', | ||
| redirect: 'manual', | ||
| headers: { 'User-Agent': SPACECAT_USER_AGENT, ...headers }, | ||
| signal: AbortSignal.timeout(PROBE_TIMEOUT_MS), | ||
| }); | ||
| const headersObj = Object.fromEntries(response.headers); | ||
| let html = ''; | ||
| const contentLength = parseInt(headersObj['content-length'] || '0', 10); | ||
| if (contentLength <= BODY_READ_MAX_BYTES) { | ||
| try { | ||
| let timer; | ||
| html = await Promise.race([ | ||
| response.text().finally(() => clearTimeout(timer)), | ||
| new Promise((_, reject) => { | ||
| timer = setTimeout(() => reject(new Error('body-read-timeout')), BODY_READ_TIMEOUT_MS); | ||
| }), | ||
| ]); | ||
| } catch { | ||
| html = ''; | ||
| } | ||
| } | ||
| return analyzeBotProtection({ status: response.status, headers: headersObj, html }); | ||
| } catch (err) { | ||
| log?.debug?.(`[bot-blocker] undici probe inconclusive for ${baseUrl}: ${err.message}`); | ||
| return { crawlable: true, type: 'unknown', confidence: 0.3 }; | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Multi-client bot-blocker detection. | ||
| * | ||
| * Probes the site with BOTH the @adobe/fetch client (via the shared | ||
| * {@link detectBotBlocker}) and Node's native fetch (undici), because Cloudflare Bot | ||
| * Management blocks on the HTTP client fingerprint: a site can allow @adobe/fetch while | ||
| * blocking undici — the client CWV/preflight/imports use — and headless Chrome (the | ||
| * scraper). A single-client probe therefore yields false "crawlable: Yes" verdicts | ||
| * (SITES-47217 / datacom.com). | ||
| * | ||
| * The return value keeps the shared {@link detectBotBlocker} shape (so existing | ||
| * consumers — the onboarding waitlist reason, the controller response — keep working), | ||
| * but `crawlable` is the AGGREGATE across clients (false if ANY representative client | ||
| * is blocked) and a `perClient` breakdown is added. The top-level `type`/`confidence` | ||
| * describe the blocking client so downstream messaging is accurate. | ||
| * | ||
| * NOTE: headless Chrome is intentionally NOT probed here — api-service has no browser. | ||
| * The scraper-backed headless confirmation is tracked as a follow-up; until then a | ||
| * "crawlable: true" verdict means "the lightweight HTTP clients were allowed", not | ||
| * "headless scraping will succeed". | ||
| * | ||
| * @param {Object} opts | ||
| * @param {string} opts.baseUrl - URL to check. | ||
| * @param {Object} [opts.headers] - Optional extra headers forwarded to both probes. | ||
| * @param {Object} [log=console] - Logger. | ||
| * @returns {Promise<Object>} detectBotBlocker-shaped result + `perClient`. | ||
| */ | ||
| export async function detectBotBlockerMultiClient( | ||
| { baseUrl, headers = {} } = {}, | ||
| { log = console, detectBotBlockerFn = detectBotBlocker, fetchFn = fetch } = {}, | ||
| ) { | ||
| const [adobe, undici] = await Promise.all([ | ||
| // Match probeWithUndici's behaviour: a probe failure (timeout/DNS/network) is | ||
| // inconclusive, not a block — so neither probe can reject the whole call. | ||
| Promise.resolve() | ||
| .then(() => detectBotBlockerFn({ baseUrl, headers })) | ||
| .catch((err) => { | ||
| log?.debug?.(`[bot-blocker] @adobe/fetch probe inconclusive for ${baseUrl}: ${err.message}`); | ||
| return { crawlable: true, type: 'unknown', confidence: 0.3 }; | ||
| }), | ||
| probeWithUndici(baseUrl, headers, log, fetchFn), | ||
| ]); | ||
|
|
||
| const perClient = { | ||
| 'adobe-fetch': { crawlable: adobe.crawlable, type: adobe.type, confidence: adobe.confidence }, | ||
| undici: { crawlable: undici.crawlable, type: undici.type, confidence: undici.confidence }, | ||
| }; | ||
|
|
||
| const crawlable = adobe.crawlable && undici.crawlable; | ||
|
|
||
| // Surface the blocking client's classification at the top level. Prefer the | ||
| // @adobe/fetch block (it carries allowlist IPs/UA from the shared probe); fall back | ||
| // to the undici block when @adobe/fetch was allowed but undici was not. | ||
| let blocker = adobe; | ||
| if (adobe.crawlable && !undici.crawlable) { | ||
| blocker = undici; | ||
| } | ||
|
|
||
| return { | ||
| ...adobe, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (blocking): The ...adobe spread places all of adobe fields (including reason) on the return object. The conditional spread on line 115 only adds blocker.reason when truthy - but if blocker is undici and undici has no reason, the adobe reason from the initial spread survives unclobbered. This means the output carries a reason from the wrong client when only undici is blocked. Scenario: adobe returns { crawlable: true, reason: 'informational', ... }, undici returns { crawlable: false, type: 'cloudflare', reason: undefined }. Output: { crawlable: false, type: 'cloudflare', reason: 'informational' } - the reason describes the non-blocking client. Fix: Add reason: blocker.reason || undefined as an explicit field (replacing the conditional spread), or delete the ...adobe spread and explicitly list only the fields you intend to forward. |
||
| crawlable, | ||
| type: blocker.type, | ||
| confidence: blocker.confidence, | ||
| // Always reflect the blocking client's reason (overriding any reason the | ||
| // ...adobe spread carried), so the reason never describes the wrong client. | ||
| reason: blocker.reason || undefined, | ||
| perClient, | ||
| }; | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): The blocker-preference logic (let blocker = adobe; if adobe.crawlable and not undici.crawlable then blocker = undici) has a third branch: both blocked. When both are blocked, blocker stays as adobe (intentional per the comment about allowlist IPs). This branch has no test - the next developer who touches this conditional could break the preference without any test failing.
Fix: Add a test where both stubs return crawlable: false with different types and assert that result.type matches the adobe probe type.