Skip to content

docs(scripts): update crawl_docs.py + modal_app.py for docs-mcp S3 pipeline#2003

Open
r33drichards wants to merge 2 commits into
mainfrom
cua-646
Open

docs(scripts): update crawl_docs.py + modal_app.py for docs-mcp S3 pipeline#2003
r33drichards wants to merge 2 commits into
mainfrom
cua-646

Conversation

@r33drichards

@r33drichards r33drichards commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Moves the updated docs crawler scripts from trycua/cloud (PR #5197) to trycua/cua, where they belong.

What changed

Replaces the old crawl4ai + Modal-volume based scripts with a streamlined aiohttp + BeautifulSoup crawl pipeline:

docs/scripts/modal_app.py (production)

  • Scheduled daily Modal job (06:00 UTC) that crawls cua.ai/docs and /cua-driver
  • Embeds with OpenAI text-embedding-3-small
  • Writes LanceDB + SQLite FTS5 databases to s3://trycua-docs-mcp-data/docs_db/
  • Synced to the K3s cluster by the docs-mcp-s3-sync CronJob

docs/scripts/crawl_docs.py (local dev)

  • Lightweight standalone crawler for local development and one-off re-indexing
  • Same URL filter logic as modal_app.py so local results match production
  • Run with: uv run docs/scripts/crawl_docs.py [--no-embed]

Why here

The docs website rendering code lives in trycua/cloud (src/website/), but the crawl/indexing infrastructure that feeds the docs-mcp server is operational tooling that belongs in trycua/cua alongside the rest of the backend.

Closes trycua/cloud#5197

CUA-646

Summary by CodeRabbit

  • Chores
    • Refactored documentation indexing infrastructure for improved performance and maintainability.
    • Updated database schema and embedding generation for documentation search.
    • Streamlined documentation crawling and processing pipeline.

@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Jun 25, 2026 7:33am

Request Review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a2b28d7-206d-422d-8096-0b88b03074cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Both docs/scripts/crawl_docs.py and docs/scripts/modal_app.py are fully rewritten to replace the crawl4ai/sentence-transformers stack with an aiohttp + BeautifulSoup BFS crawler, overlapping-chunk text splitting, OpenAI text-embedding-3-small embedding, and dual-database output (LanceDB docs table + SQLite FTS5 docs_fts). The Modal app is reduced from a multi-purpose MCP web server with code indexing to a single daily-scheduled crawl-and-upload job.

Changes

Docs Crawler Pipeline Rewrite

Layer / File(s) Summary
Configuration and constants
docs/scripts/crawl_docs.py, docs/scripts/modal_app.py
Both files receive new module headers, seed URLs, VALID_PATH_PREFIXES allowlists, S3 bucket/prefix constants, embedding model and dimension constants, chunking parameters, and updated container image dependencies (aiohttp, bs4, openai, lancedb, boto3). Modal app is renamed from cua-docs-mcp to docs-mcp-crawl.
URL validation, page fetching, and BFS traversal
docs/scripts/crawl_docs.py, docs/scripts/modal_app.py
is_valid_url enforces cua.ai/www.cua.ai host matching with path-prefix allowlist and blocklist; fetch_page uses aiohttp with content-type checks; extract_links and extract_text use BeautifulSoup to remove nav/header/footer/script/style noise; crawl drives a batched BFS queue with a visited set and configurable crawl delay — implemented identically in both files.
Text chunking and OpenAI embedding
docs/scripts/crawl_docs.py, docs/scripts/modal_app.py
chunk_text splits extracted text into overlapping character segments with newline-boundary preference; embed_chunks (in modal_app.py) calls OpenAI text-embedding-3-small in fixed-size batches with explicit dimensions=1536; crawl_docs.py embeds inline inside build_databases.
LanceDB and SQLite FTS5 database construction
docs/scripts/crawl_docs.py, docs/scripts/modal_app.py
build_databases generates deterministic chunk IDs from url#index, drops and recreates a LanceDB docs table with precomputed OpenAI vectors, and creates a SQLite FTS5 virtual table docs_fts with id/url/title/text — removing the separate non-FTS backing table present in the prior design.
CLI and Modal entrypoints
docs/scripts/crawl_docs.py, docs/scripts/modal_app.py
crawl_docs.py CLI gains --out-dir, --no-embed, and repeatable --seed flags, enforcing OPENAI_API_KEY when embedding. modal_app.py removes all prior Modal functions (crawl_docs, generate_vector_db, generate_code_index*, web, etc.) and replaces them with a single scheduled_crawl() (daily 06:00 UTC) that runs crawl → build → recursive S3 upload, plus a local main() gated on OPENAI_API_KEY.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hop, hop through every link I go,
BeautifulSoup cleans the text just so,
Chunks overlap like footprints in the snow,
OpenAI vectors make the knowledge glow,
LanceDB and FTS5 — what a show!
No more crawl4ai, just aiohttp flow,
The rabbit has indexed the docs, you know! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: updates to crawl_docs.py and modal_app.py for the docs-mcp S3 pipeline, which aligns with the core objective of migrating documentation crawler scripts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cua-646

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/scripts/crawl_docs.py`:
- Around line 211-223: The chunk_text function has an infinite loop issue where
once end reaches len(text), start is set to end - overlap (which remains less
than len(text)), causing the loop condition to stay true indefinitely while
processing the same final chunk repeatedly. To fix this, add a break statement
after the chunks.append(text[start:end].strip()) line to check if end has
reached len(text), ensuring the loop terminates once all text has been processed
and no more progress can be made.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 619b0045-c935-4d1e-b265-78b212f8324d

📥 Commits

Reviewing files that changed from the base of the PR and between c898d7b and cfc7738.

📒 Files selected for processing (2)
  • docs/scripts/crawl_docs.py
  • docs/scripts/modal_app.py

Comment on lines +211 to +223
def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
if len(text) <= chunk_size:
return [text]
chunks: list[str] = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
boundary = text.rfind("\n\n", start, end)
if boundary > start + overlap:
end = boundary
chunks.append(text[start:end].strip())
start = end - overlap
return [c for c in chunks if c]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

Infinite loop in chunk_text for any text longer than chunk_size.

When end reaches len(text), start is set to end - overlap == len(text) - overlap, which is always < len(text). The loop condition start < len(text) stays true and end is recomputed back to len(text) on every subsequent iteration, so start never advances and the loop never terminates. This triggers for any page exceeding CHUNK_SIZE (800 chars), which is effectively every docs page, hanging build_databases.

🐛 Proposed fix: terminate once the end of the text is reached
         end = min(start + chunk_size, len(text))
         boundary = text.rfind("\n\n", start, end)
         if boundary > start + overlap:
             end = boundary
         chunks.append(text[start:end].strip())
+        if end >= len(text):
+            break
         start = end - overlap
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/scripts/crawl_docs.py` around lines 211 - 223, The chunk_text function
has an infinite loop issue where once end reaches len(text), start is set to end
- overlap (which remains less than len(text)), causing the loop condition to
stay true indefinitely while processing the same final chunk repeatedly. To fix
this, add a break statement after the chunks.append(text[start:end].strip())
line to check if end has reached len(text), ensuring the loop terminates once
all text has been processed and no more progress can be made.

r33drichards pushed a commit that referenced this pull request Jun 25, 2026
- Add break when end >= len(text) to prevent infinite loop when text
  ends exactly on a chunk boundary (both crawl_docs.py and modal_app.py)
- Add docstrings to all functions in crawl_docs.py to pass the 80%
  docstring coverage threshold flagged by CodeRabbit

Addresses CodeRabbit review on PR #2003
cuaclaw added 2 commits June 25, 2026 07:31
Replaces the old crawl4ai/Modal-volume-based scripts with updated
lightweight aiohttp+BeautifulSoup crawlers that:
- Use OpenAI text-embedding-3-small for embeddings
- Write LanceDB + SQLite FTS5 databases
- Upload to s3://trycua-docs-mcp-data/docs_db/ (modal_app.py)
- Support local one-off re-indexing without Modal (crawl_docs.py)
- Include /cua-driver product page in crawl scope

These scripts were previously contributed to trycua/cloud PR #5197 but
belong in trycua/cua since they are the production crawl infrastructure.

CUA-646
- Add break when end >= len(text) to prevent infinite loop when text
  ends exactly on a chunk boundary (both crawl_docs.py and modal_app.py)
- Add docstrings to all functions in crawl_docs.py to pass the 80%
  docstring coverage threshold flagged by CodeRabbit

Addresses CodeRabbit review on PR #2003
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants