docs(scripts): update crawl_docs.py + modal_app.py for docs-mcp S3 pipeline#2003
docs(scripts): update crawl_docs.py + modal_app.py for docs-mcp S3 pipeline#2003r33drichards wants to merge 2 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughBoth ChangesDocs Crawler Pipeline Rewrite
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/scripts/crawl_docs.py`:
- Around line 211-223: The chunk_text function has an infinite loop issue where
once end reaches len(text), start is set to end - overlap (which remains less
than len(text)), causing the loop condition to stay true indefinitely while
processing the same final chunk repeatedly. To fix this, add a break statement
after the chunks.append(text[start:end].strip()) line to check if end has
reached len(text), ensuring the loop terminates once all text has been processed
and no more progress can be made.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 619b0045-c935-4d1e-b265-78b212f8324d
📒 Files selected for processing (2)
docs/scripts/crawl_docs.pydocs/scripts/modal_app.py
| def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]: | ||
| if len(text) <= chunk_size: | ||
| return [text] | ||
| chunks: list[str] = [] | ||
| start = 0 | ||
| while start < len(text): | ||
| end = min(start + chunk_size, len(text)) | ||
| boundary = text.rfind("\n\n", start, end) | ||
| if boundary > start + overlap: | ||
| end = boundary | ||
| chunks.append(text[start:end].strip()) | ||
| start = end - overlap | ||
| return [c for c in chunks if c] |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
Infinite loop in chunk_text for any text longer than chunk_size.
When end reaches len(text), start is set to end - overlap == len(text) - overlap, which is always < len(text). The loop condition start < len(text) stays true and end is recomputed back to len(text) on every subsequent iteration, so start never advances and the loop never terminates. This triggers for any page exceeding CHUNK_SIZE (800 chars), which is effectively every docs page, hanging build_databases.
🐛 Proposed fix: terminate once the end of the text is reached
end = min(start + chunk_size, len(text))
boundary = text.rfind("\n\n", start, end)
if boundary > start + overlap:
end = boundary
chunks.append(text[start:end].strip())
+ if end >= len(text):
+ break
start = end - overlap🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/scripts/crawl_docs.py` around lines 211 - 223, The chunk_text function
has an infinite loop issue where once end reaches len(text), start is set to end
- overlap (which remains less than len(text)), causing the loop condition to
stay true indefinitely while processing the same final chunk repeatedly. To fix
this, add a break statement after the chunks.append(text[start:end].strip())
line to check if end has reached len(text), ensuring the loop terminates once
all text has been processed and no more progress can be made.
- Add break when end >= len(text) to prevent infinite loop when text ends exactly on a chunk boundary (both crawl_docs.py and modal_app.py) - Add docstrings to all functions in crawl_docs.py to pass the 80% docstring coverage threshold flagged by CodeRabbit Addresses CodeRabbit review on PR #2003
Replaces the old crawl4ai/Modal-volume-based scripts with updated lightweight aiohttp+BeautifulSoup crawlers that: - Use OpenAI text-embedding-3-small for embeddings - Write LanceDB + SQLite FTS5 databases - Upload to s3://trycua-docs-mcp-data/docs_db/ (modal_app.py) - Support local one-off re-indexing without Modal (crawl_docs.py) - Include /cua-driver product page in crawl scope These scripts were previously contributed to trycua/cloud PR #5197 but belong in trycua/cua since they are the production crawl infrastructure. CUA-646
- Add break when end >= len(text) to prevent infinite loop when text ends exactly on a chunk boundary (both crawl_docs.py and modal_app.py) - Add docstrings to all functions in crawl_docs.py to pass the 80% docstring coverage threshold flagged by CodeRabbit Addresses CodeRabbit review on PR #2003
Summary
Moves the updated docs crawler scripts from trycua/cloud (PR #5197) to trycua/cua, where they belong.
What changed
Replaces the old crawl4ai + Modal-volume based scripts with a streamlined aiohttp + BeautifulSoup crawl pipeline:
docs/scripts/modal_app.py(production)docs/scripts/crawl_docs.py(local dev)Why here
The docs website rendering code lives in trycua/cloud (src/website/), but the crawl/indexing infrastructure that feeds the docs-mcp server is operational tooling that belongs in trycua/cua alongside the rest of the backend.
Closes trycua/cloud#5197
CUA-646
Summary by CodeRabbit