docs(scripts): update crawl_docs.py + modal_app.py for docs-mcp S3 pipeline by r33drichards · Pull Request #2003 · trycua/cua

r33drichards · 2026-06-23T23:14:29Z

Summary

Moves the updated docs crawler scripts from trycua/cloud (PR #5197) to trycua/cua, where they belong.

What changed

Replaces the old crawl4ai + Modal-volume based scripts with a streamlined aiohttp + BeautifulSoup crawl pipeline:

`docs/scripts/modal_app.py` (production)

Scheduled daily Modal job (06:00 UTC) that crawls cua.ai/docs and /cua-driver
Embeds with OpenAI text-embedding-3-small
Writes LanceDB + SQLite FTS5 databases to s3://trycua-docs-mcp-data/docs_db/
Synced to the K3s cluster by the docs-mcp-s3-sync CronJob

`docs/scripts/crawl_docs.py` (local dev)

Lightweight standalone crawler for local development and one-off re-indexing
Same URL filter logic as modal_app.py so local results match production
Run with: uv run docs/scripts/crawl_docs.py [--no-embed]

Why here

The docs website rendering code lives in trycua/cloud (src/website/), but the crawl/indexing infrastructure that feeds the docs-mcp server is operational tooling that belongs in trycua/cua alongside the rest of the backend.

Closes trycua/cloud#5197

CUA-646

Summary by CodeRabbit

Chores
- Refactored documentation indexing infrastructure for improved performance and maintainability.
- Updated database schema and embedding generation for documentation search.
- Streamlined documentation crawling and processing pipeline.

vercel · 2026-06-23T23:14:34Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docs	Ready	Preview, Comment	Jun 25, 2026 7:33am

coderabbitai · 2026-06-23T23:15:01Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a2b28d7-206d-422d-8096-0b88b03074cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Both docs/scripts/crawl_docs.py and docs/scripts/modal_app.py are fully rewritten to replace the crawl4ai/sentence-transformers stack with an aiohttp + BeautifulSoup BFS crawler, overlapping-chunk text splitting, OpenAI text-embedding-3-small embedding, and dual-database output (LanceDB docs table + SQLite FTS5 docs_fts). The Modal app is reduced from a multi-purpose MCP web server with code indexing to a single daily-scheduled crawl-and-upload job.

Changes

Docs Crawler Pipeline Rewrite

Layer / File(s)	Summary
Configuration and constants `docs/scripts/crawl_docs.py`, `docs/scripts/modal_app.py`	Both files receive new module headers, seed URLs, `VALID_PATH_PREFIXES` allowlists, S3 bucket/prefix constants, embedding model and dimension constants, chunking parameters, and updated container image dependencies (`aiohttp`, `bs4`, `openai`, `lancedb`, `boto3`). Modal app is renamed from `cua-docs-mcp` to `docs-mcp-crawl`.
URL validation, page fetching, and BFS traversal `docs/scripts/crawl_docs.py`, `docs/scripts/modal_app.py`	`is_valid_url` enforces `cua.ai`/`www.cua.ai` host matching with path-prefix allowlist and blocklist; `fetch_page` uses `aiohttp` with content-type checks; `extract_links` and `extract_text` use BeautifulSoup to remove nav/header/footer/script/style noise; `crawl` drives a batched BFS queue with a visited set and configurable crawl delay — implemented identically in both files.
Text chunking and OpenAI embedding `docs/scripts/crawl_docs.py`, `docs/scripts/modal_app.py`	`chunk_text` splits extracted text into overlapping character segments with newline-boundary preference; `embed_chunks` (in `modal_app.py`) calls OpenAI `text-embedding-3-small` in fixed-size batches with explicit `dimensions=1536`; `crawl_docs.py` embeds inline inside `build_databases`.
LanceDB and SQLite FTS5 database construction `docs/scripts/crawl_docs.py`, `docs/scripts/modal_app.py`	`build_databases` generates deterministic chunk IDs from `url#index`, drops and recreates a LanceDB `docs` table with precomputed OpenAI vectors, and creates a SQLite FTS5 virtual table `docs_fts` with `id/url/title/text` — removing the separate non-FTS backing table present in the prior design.
CLI and Modal entrypoints `docs/scripts/crawl_docs.py`, `docs/scripts/modal_app.py`	`crawl_docs.py` CLI gains `--out-dir`, `--no-embed`, and repeatable `--seed` flags, enforcing `OPENAI_API_KEY` when embedding. `modal_app.py` removes all prior Modal functions (`crawl_docs`, `generate_vector_db`, `generate_code_index*`, `web`, etc.) and replaces them with a single `scheduled_crawl()` (daily 06:00 UTC) that runs crawl → build → recursive S3 upload, plus a local `main()` gated on `OPENAI_API_KEY`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hop, hop through every link I go,
BeautifulSoup cleans the text just so,
Chunks overlap like footprints in the snow,
OpenAI vectors make the knowledge glow,
LanceDB and FTS5 — what a show!
No more crawl4ai, just aiohttp flow,
The rabbit has indexed the docs, you know! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: updates to crawl_docs.py and modal_app.py for the docs-mcp S3 pipeline, which aligns with the core objective of migrating documentation crawler scripts.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cua-646

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/scripts/crawl_docs.py`:
- Around line 211-223: The chunk_text function has an infinite loop issue where
once end reaches len(text), start is set to end - overlap (which remains less
than len(text)), causing the loop condition to stay true indefinitely while
processing the same final chunk repeatedly. To fix this, add a break statement
after the chunks.append(text[start:end].strip()) line to check if end has
reached len(text), ensuring the loop terminates once all text has been processed
and no more progress can be made.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 619b0045-c935-4d1e-b265-78b212f8324d

📥 Commits

Reviewing files that changed from the base of the PR and between c898d7b and cfc7738.

📒 Files selected for processing (2)

docs/scripts/crawl_docs.py
docs/scripts/modal_app.py

coderabbitai · 2026-06-23T23:20:03Z

+def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
+    if len(text) <= chunk_size:
+        return [text]
+    chunks: list[str] = []
+    start = 0
+    while start < len(text):
+        end = min(start + chunk_size, len(text))
+        boundary = text.rfind("\n\n", start, end)
+        if boundary > start + overlap:
+            end = boundary
+        chunks.append(text[start:end].strip())
+        start = end - overlap
+    return [c for c in chunks if c]


🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

Infinite loop in chunk_text for any text longer than chunk_size.

When end reaches len(text), start is set to end - overlap == len(text) - overlap, which is always < len(text). The loop condition start < len(text) stays true and end is recomputed back to len(text) on every subsequent iteration, so start never advances and the loop never terminates. This triggers for any page exceeding CHUNK_SIZE (800 chars), which is effectively every docs page, hanging build_databases.

🐛 Proposed fix: terminate once the end of the text is reached

end = min(start + chunk_size, len(text)) boundary = text.rfind("\n\n", start, end) if boundary > start + overlap: end = boundary chunks.append(text[start:end].strip()) + if end >= len(text): + break start = end - overlap

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/scripts/crawl_docs.py` around lines 211 - 223, The chunk_text function has an infinite loop issue where once end reaches len(text), start is set to end - overlap (which remains less than len(text)), causing the loop condition to stay true indefinitely while processing the same final chunk repeatedly. To fix this, add a break statement after the chunks.append(text[start:end].strip()) line to check if end has reached len(text), ensuring the loop terminates once all text has been processed and no more progress can be made.

- Add break when end >= len(text) to prevent infinite loop when text ends exactly on a chunk boundary (both crawl_docs.py and modal_app.py) - Add docstrings to all functions in crawl_docs.py to pass the 80% docstring coverage threshold flagged by CodeRabbit Addresses CodeRabbit review on PR #2003

Replaces the old crawl4ai/Modal-volume-based scripts with updated lightweight aiohttp+BeautifulSoup crawlers that: - Use OpenAI text-embedding-3-small for embeddings - Write LanceDB + SQLite FTS5 databases - Upload to s3://trycua-docs-mcp-data/docs_db/ (modal_app.py) - Support local one-off re-indexing without Modal (crawl_docs.py) - Include /cua-driver product page in crawl scope These scripts were previously contributed to trycua/cloud PR #5197 but belong in trycua/cua since they are the production crawl infrastructure. CUA-646

- Add break when end >= len(text) to prevent infinite loop when text ends exactly on a chunk boundary (both crawl_docs.py and modal_app.py) - Add docstrings to all functions in crawl_docs.py to pass the 80% docstring coverage threshold flagged by CodeRabbit Addresses CodeRabbit review on PR #2003

vercel Bot deployed to Preview June 23, 2026 23:15 View deployment

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

vercel Bot deployed to Preview June 25, 2026 07:13 View deployment

cuaclaw added 2 commits June 25, 2026 07:31

r33drichards force-pushed the cua-646 branch from 3772618 to ff16983 Compare June 25, 2026 07:31

vercel Bot deployed to Preview June 25, 2026 07:33 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

docs(scripts): update crawl_docs.py + modal_app.py for docs-mcp S3 pipeline#2003

docs(scripts): update crawl_docs.py + modal_app.py for docs-mcp S3 pipeline#2003
r33drichards wants to merge 2 commits into
mainfrom
cua-646

r33drichards commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

r33drichards commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

docs/scripts/modal_app.py (production)

docs/scripts/crawl_docs.py (local dev)

Why here

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

r33drichards commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

`docs/scripts/modal_app.py` (production)

`docs/scripts/crawl_docs.py` (local dev)

vercel Bot commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading