Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
b21396a
feat(web): add web plugin with browser automation and page analysis s…
catalan-adobe May 30, 2026
e97c742
refactor(web): commit to playwright-cli as single browser layer
catalan-adobe Jun 2, 2026
9440ec0
fix(web): fix eval assertions and stripEnvelope bug
catalan-adobe Jun 2, 2026
7b0ec67
fix(web): fix four HIGH review findings
catalan-adobe Jun 2, 2026
c6f0a5a
fix(web): fix MEDIUM review findings
catalan-adobe Jun 2, 2026
daec9af
docs(domain-mask): remove internal proxy mechanics section
catalan-adobe Jun 2, 2026
7006e1d
docs(browser-probe): condense Step 3 into table, trim consumer section
catalan-adobe Jun 2, 2026
b91087c
docs(browser-probe): remove duplicate signal table, tighten mapping t…
catalan-adobe Jun 2, 2026
8968388
docs(cdp-ext-pilot): split tips into troubleshooting reference
catalan-adobe Jun 2, 2026
3869a7c
docs(page-prep): trim ~50 lines via reference extraction
catalan-adobe Jun 2, 2026
a8e6265
docs(page-prep): remove repeated cmp-match/heuristic explanations and…
catalan-adobe Jun 2, 2026
8a83e96
docs(page-prep): move format schemas to references/formats.md
catalan-adobe Jun 2, 2026
1721e4b
docs(page-prep): trim explanatory rationale passages
catalan-adobe Jun 2, 2026
f4f1b18
docs(page-prep): remove narration, IIFE explanation, restated mode info
catalan-adobe Jun 2, 2026
9bdf86a
docs(reduce-page): drop why-pattern block, compress Phase 1 JSON, rem…
catalan-adobe Jun 2, 2026
05cfa64
docs(visual-tree): remove Pipeline section and redundant tip
catalan-adobe Jun 2, 2026
e4c8b20
refactor(web): rename reduce-page → page-reduce, visual-tree → page-tree
catalan-adobe Jun 2, 2026
3687651
fix(browser-probe): resolve symlinks in isMain check, fallback for mi…
catalan-adobe Jun 4, 2026
1e20661
fix(cdp-ext-pilot): fall back to tab mode when no content script cont…
catalan-adobe Jun 4, 2026
7202e67
fix(page-prep): correct playwright-cli screenshot syntax in Step 9b
catalan-adobe Jun 4, 2026
d4aa441
fix(page-prep): exclude off-screen elements from DOM residual check
catalan-adobe Jun 4, 2026
14793cd
fix(page-collect): write tmp files inside output dir, not /tmp/
catalan-adobe Jun 4, 2026
d3330a8
docs(web): add contributor docs for playwright-cli constraints and lo…
catalan-adobe Jun 4, 2026
08a2324
feat(web): add page-langs skill for webpage language detection
catalan-adobe Jun 8, 2026
9897b94
docs(page-langs): add validation checkpoint and error-recovery guidance
catalan-adobe Jun 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,20 @@
"repository": "https://github.com/adobe/skills",
"license": "Apache-2.0"
},
{
"name": "web",
"source": "./plugins/web",
"description": "Browser automation and web page analysis skills: detect the browser layer, connect via CDP, probe bot protection, dismiss overlays, capture DOM trees, reduce pages to skeletons, extract page resources.",
"version": "1.0.0",
"category": "web",
"keywords": ["browser", "playwright", "cdp", "web-scraping", "page-analysis", "automation"],
"author": {
"name": "Adobe"
},
"homepage": "https://github.com/adobe/skills",
"repository": "https://github.com/adobe/skills",
"license": "Apache-2.0"
},
{
"name": "aem-edge-delivery-services",
"source": "./plugins/aem/edge-delivery-services",
Expand Down
3 changes: 3 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,6 @@

# Stardust
/plugins/stardust @paolomoz

# Web (browser automation and page analysis)
/plugins/web @catalan-adobe
11 changes: 11 additions & 0 deletions plugins/web/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"name": "web",
"description": "Browser automation and web page analysis skills: detect the available browser layer, connect via CDP, probe CDN bot protection, dismiss overlays, capture spatial DOM trees, reduce pages to skeletons, and extract structured page resources.",
"version": "1.0.0",
"author": {
"name": "Adobe"
},
"repository": "https://github.com/adobe/skills",
"license": "Apache-2.0",
"keywords": ["browser", "playwright", "cdp", "web-scraping", "page-analysis", "automation"]
}
1 change: 1 addition & 0 deletions plugins/web/skills/browser-probe/.releaserc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"extends": "../../../../../release.config.cjs"}
160 changes: 160 additions & 0 deletions plugins/web/skills/browser-probe/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
---
name: browser-probe
license: Apache-2.0
description: >-
Probe a URL with escalating headless browser configurations to detect CDN bot
protection (Akamai, Cloudflare, DataDome, AWS WAF) and produce a
browser-recipe.json that downstream playwright-cli consumers use to bypass
blocking. Runs an automated escalation ladder: default headless → stealth
script injection → system Chrome (TLS fingerprint fix) → persistent profile.
Use BEFORE any playwright-cli interaction with an untrusted domain. Triggers
on: browser probe, site blocked, headless blocked, CDN blocking, bot
detection, browser recipe, can't load page, 403 error page, access denied.
---

# Browser Probe

Detect CDN bot protection blocking headless Chrome and produce a browser recipe
for downstream `playwright-cli` consumers. Node 22+ required. No npm
dependencies.

## When to Use

Run this skill **before** any `playwright-cli` interaction with a domain you
haven't tested, or when a downstream script reports a blocked page. Common
triggers:

- First interaction with a new domain
- `capture-snapshot.js` produces empty/error snapshots
- Page title contains "error", "denied", "blocked", "captcha"
- HTTP 403 responses from headless browser

## Script Location

```bash
if [[ -n "${CLAUDE_SKILL_DIR:-}" ]]; then
PROBE_DIR="${CLAUDE_SKILL_DIR}/scripts"
else
PROBE_DIR="$(dirname "$(command -v browser-probe.js 2>/dev/null || \
find ~/.claude -path "*/browser-probe/scripts/browser-probe.js" \
-type f 2>/dev/null | head -1)")"
fi
```

## Workflow

### Step 1 — Run the probe

```bash
node "$PROBE_DIR/browser-probe.js" "$URL" "$OUTPUT_DIR"
```

The script tries up to 5 browser configurations, stopping at the first success:

1. **default** — headless Chromium (baseline)
2. **stealth** — headless Chromium + JS stealth init script (patches `navigator.webdriver`, plugins, languages)
3. **stealth-ua** — headless Chromium + JS stealth + User-Agent override (removes `HeadlessChrome` from HTTP UA header via `--user-agent` launch arg)
4. **chrome** — system Chrome (`--browser=chrome`) + JS stealth + UA override (fixes TLS fingerprint detection)
5. **persistent** — system Chrome + JS stealth + UA override + persistent profile (cookie/session challenges)

Output: `$OUTPUT_DIR/probe-report.json`

### Step 2 — Read the report

Load `probe-report.json`. Check `firstSuccess`:
- If non-null: a configuration worked. Proceed to Step 3.
- If null: all configurations failed. Skip to Step 5.

### Step 3 — Interpret results

Load the stealth configuration reference at `references/stealth-config.md` and match the
`detectedSignals` array against the Provider Signature Table.

Key interpretation rules:
- `cloudfront-block` or `stealth` fails but `stealth-ua` succeeds →
CloudFront WAF UA-based blocking (matches `HeadlessChrome` in HTTP
User-Agent header). Common on pharma/enterprise sites. Simple fix,
no TLS concerns. `stealth-ua` is the minimum working config.
- `cloudfront` without `cloudfront-block` → CloudFront present but not
actively blocking. Default config may work.
- `akamai-server` or `akamai-bot-manager` → TLS fingerprint blocking.
System Chrome is the fix. Stealth + UA alone is insufficient.
- `cloudflare-ray` without `cloudflare-challenge` → Cloudflare present
but not actively blocking. Default config may work.
- `cloudflare-challenge` → Active JS challenge. System Chrome + stealth
+ UA usually resolves it.
- `datadome` → Aggressive detection. System Chrome + stealth + UA required.
- `aws-waf` → Usually UA-based. Stealth + UA often sufficient.
- No signals + blocked → Unknown protection. Persistent profile is last
resort.

### Step 4 — Generate recipe

Write `browser-recipe.json` to `$OUTPUT_DIR`:

```json
{
"url": "<probed URL>",
"generated": "<ISO timestamp>",
"cliConfig": {
"browser": {
"browserName": "chromium",
"launchOptions": { "channel": "<from firstSuccess step>" }
}
},
"stealthInitScript": "<full script from stealth-config.md if stealth was needed>",
"notes": "<1-2 sentence explanation of what was detected and why this config>"
}
```

**Config mapping from `firstSuccess`:**

| firstSuccess | cliConfig.launchOptions | stealthInitScript |
|---|---|---|
| `default` | `{}` (no channel, no args) | `null` (not needed) |
| `stealth` | `{}` (no channel, no args) | Full stealth script from reference |
| `stealth-ua` | `{ "args": ["--user-agent=<realistic UA>"] }` | Full stealth script from reference |
| `chrome` | `{ "channel": "chrome", "args": ["--user-agent=<realistic UA>"] }` | Full stealth script from reference |
| `persistent` | `{ "channel": "chrome", "args": ["--user-agent=<realistic UA>"] }` | Full stealth script from reference |

If `firstSuccess` is `persistent`, add a `"persistent": true` field to the
recipe so consumers know to use `--persistent`.

### Step 5 — Report results

**If a configuration worked:**
```
Browser probe complete for <url>.
Working config: <firstSuccess>
Detected: <detectedSignals or "no bot protection detected">
Recipe: <path to browser-recipe.json>
```

**If all configurations failed:**
```
Browser probe failed for <url>. No headless configuration could load the page.
Tried: default, stealth, stealth-ua, chrome, persistent
Detected signals: <detectedSignals>

Options:
1. Use --headed flag for manual browser interaction
2. Provide pre-captured data (DOM snapshot, screenshots) manually
3. Check if the URL requires authentication or VPN access
```

Do NOT produce a recipe when all steps fail. Do NOT silently continue
with a broken configuration.

## How Consumers Use the Recipe

Any script using `playwright-cli` can consume `browser-recipe.json`:

1. Write `cliConfig` to a temp file (e.g., `/tmp/probe-cli-config.json`)
2. If recipe has `stealthInitScript`, write it to a temp file and add
it to the config's `browser.initScript` array (do NOT use
`playwright-cli eval` — eval only accepts pure expressions, not
multi-statement scripts)
3. Pass `--config=/tmp/probe-cli-config.json` to `playwright-cli open`
4. Proceed with normal `goto <url>` and workflow

If recipe has `"persistent": true`, also pass `--persistent` to `open`.
18 changes: 18 additions & 0 deletions plugins/web/skills/browser-probe/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"skill_name": "browser-probe",
"evals": [
{
"id": 1,
"prompt": "Check if https://example.com has bot protection and get a browser recipe for it",
"expected_output": "A browser-recipe.json is generated showing the detected protection level and recommended configuration.",
"files": [],
"assertions": [
{
"type": "command_succeeds",
"command": "node -e \"require('./scripts/browser-probe.js')\"",
"description": "Browser probe script loads without syntax errors."
}
]
}
]
}
1 change: 1 addition & 0 deletions plugins/web/skills/browser-probe/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{ "name": "browser-probe", "version": "0.0.0-semantically-released", "private": true }
98 changes: 98 additions & 0 deletions plugins/web/skills/browser-probe/references/stealth-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Stealth Configuration Reference

## Stealth Init Script

Inject via `initScript` in the playwright-cli config (NOT via `eval` —
eval only accepts pure expressions, not multi-statement scripts). Write
this script to a temp file and add the path to `browser.initScript` in
the config. It runs before any page JS loads, patching browser
fingerprints that headless detection relies on.

```js
(function() {
// Hide webdriver property (primary headless signal)
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });

// Add realistic plugins (headless Chrome has empty plugins array)
Object.defineProperty(navigator, 'plugins', {
get: () => [
{ name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer', description: 'Portable Document Format' },
{ name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai', description: '' },
{ name: 'Native Client', filename: 'internal-nacl-plugin', description: '' },
],
});

// Set realistic languages (headless may report empty)
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });

// Add chrome runtime object (missing in headless)
window.chrome = { runtime: {} };
})()
```

## User-Agent Override

Chromium's headless mode injects `HeadlessChrome` into the HTTP User-Agent
header. Many WAFs (especially CloudFront) use simple string matching on this
token as a first-pass bot filter. This is an HTTP-level signal — JS stealth
patches cannot change it.

Fix: pass a realistic UA via Chrome launch arg in a `playwright-cli` config file:

```json
{
"browser": {
"browserName": "chromium",
"launchOptions": {
"args": ["--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"]
}
}
}
```

Usage: `playwright-cli -s=<session> open --config=<path-to-config>`

## Stealth HTTP Headers

These headers mimic a real Chrome session. Currently not injectable via
`playwright-cli` (no `extraHTTPHeaders` support). Documented for future use
or for scripts using Playwright API directly.

| Header | Value |
|--------|-------|
| `Accept` | `text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8` |
| `Accept-Language` | `en-US,en;q=0.9` |
| `Accept-Encoding` | `gzip, deflate, br` |
| `Cache-Control` | `no-cache` |
| `Sec-Ch-Ua` | `"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"` |
| `Sec-Ch-Ua-Mobile` | `?0` |
| `Sec-Ch-Ua-Platform` | `"macOS"` |
| `Sec-Fetch-Dest` | `document` |
| `Sec-Fetch-Mode` | `navigate` |
| `Sec-Fetch-Site` | `none` |
| `Sec-Fetch-User` | `?1` |
| `Upgrade-Insecure-Requests` | `1` |

## Provider Signature Table

Maps observable signals (from `playwright-cli network` response headers and
page content) to CDN bot detection providers and typical remedies.

| Signal | Provider | Confidence | Typical fix |
|--------|----------|------------|-------------|
| `server: AkamaiGHost` or `server: AkamaiNetStorage` | Akamai | medium | System Chrome (`--browser=chrome`) — TLS fingerprint |
| `bm_sz` cookie in `set-cookie` | Akamai Bot Manager | high | System Chrome — TLS fingerprint |
| `_abck` cookie in `set-cookie` | Akamai Bot Manager | high | System Chrome — TLS fingerprint |
| `stealth` blocked + `stealth-ua` succeeds (no provider headers) | CloudFront UA filter | high | UA override (`--user-agent` launch arg) |
| `cf-ray` header present | Cloudflare | medium | Stealth script often sufficient |
| Page title contains "Just a moment" or "Checking your browser" | Cloudflare Challenge | high | System Chrome + stealth |
| `x-datadome` header present | DataDome | high | System Chrome + stealth |
| `x-amzn-waf-action` header present | AWS WAF | medium | Stealth script (UA-based detection) |
| `x-cdn: Imperva` or `x-iinfo` header | Incapsula/Imperva | medium | System Chrome + stealth |
| Page title contains "Access Denied" + `server: AkamaiGHost` | Akamai hard block | high | System Chrome — TLS fingerprint |
| `server: CloudFront` or `x-amz-cf-id` header | CloudFront | medium | Stealth script (often UA-based) |
| Page title contains "The request could not be satisfied" | CloudFront WAF block | high | UA override or stealth script |
| `stealth` (JS-only) succeeds, `default` blocked | JS fingerprint detection | high | Stealth script sufficient |
| `stealth` fails but `stealth-ua` succeeds | HTTP UA-based blocking | high | UA override (`--user-agent` launch arg) |
| Page title matches `/error\|denied\|blocked\|403\|captcha/i` + no known provider | Unknown WAF | low | Escalate to persistent profile |
| `status: 403` + `bodyLength < 500` | Generic block | low | Escalate through all steps |
Loading