-
Notifications
You must be signed in to change notification settings - Fork 729
feat: pypi worker with downloads #4291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+2,541
−76
Merged
Changes from 4 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
5109f98
feat: add pypi packages worker
epipav 8aa0372
feat: pypi downloads ingest
epipav 574c950
Merge branch 'main' into feat/pypi-downloads
epipav a0c9edf
docs: pypi downloads adr
epipav c915a75
refactor: move pypi downloads out of deps-dev
epipav 5937852
fix: address pypi worker review
epipav c4daff9
fix: address second review round
epipav 4142844
style: fix import order in npm proxies
epipav 1732ab3
fix: use per-run timeout on schedules
epipav 5ed1493
style: format markdown docs
epipav 18c1339
Merge branch 'main' into feat/pypi-worker-with-downloads
epipav fd4dbd7
build: register pypi-worker in packages builder
epipav b445ea5
Merge branch 'main' into feat/pypi-worker-with-downloads
epipav File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
|
|
||
| CREATE TABLE pypi_package_state ( | ||
| purl text PRIMARY KEY, | ||
| metadata_first_scanned_at timestamptz NOT NULL DEFAULT now(), | ||
| metadata_last_run_at timestamptz, | ||
| metadata_run_result jsonb -- { status, attempts, httpStatus?, errorKind?, message? } | ||
| ); | ||
|
|
||
| CREATE INDEX ON pypi_package_state (metadata_last_run_at); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| # ADR-0005: PyPI downloads via BigQuery bulk export, scoped in the Postgres merge | ||
|
|
||
| **Date**: 2026-07-01 | ||
| **Status**: accepted | ||
| **Deciders**: Anil B | ||
|
|
||
| _Consolidated ADR for the PyPI downloads worker — record further PyPI-worker download decisions here rather than opening new ADRs._ | ||
|
|
||
| ## Context | ||
|
|
||
| We need PyPI download counts to match the npm shape: daily counts for the **Critical slice** | ||
| (`downloads_daily`) and rolling 30-day **Window** counts for all tracked pypi packages | ||
| (`downloads_last_30d`, mirrored to `packages.downloads_last_30d`). Unlike npm, **PyPI exposes no | ||
| per-package downloads HTTP API** — the only source is the public BigQuery dataset | ||
| `bigquery-public-data.pypi.file_downloads` (raw per-download events, timestamp-partitioned). The | ||
| worker already has proven deps.dev BigQuery→GCS→staging→merge plumbing and a job monitor keyed on | ||
| `osspckgs_ingest_jobs`. Cost is driven by bytes scanned, and a single day of the three columns we | ||
| read (`file.project`, `timestamp`, `details.installer.name`) measures ~107 GB (weekend) / ~147 GB | ||
| (monthly average), so a 30-day window is ~4.56 TB. | ||
|
|
||
| ## Decision | ||
|
|
||
| Ingest PyPI downloads as two new `bq-dataset-ingest` job kinds (`pypi_downloads_30d`, | ||
| `pypi_downloads_daily`) that run one BigQuery aggregate over a date range, export **all** projects to | ||
| GCS, load to staging, and **scope to the Critical slice in the Postgres merge** (`JOIN packages … | ||
| AND is_critical` for daily) — we never push our package list into BigQuery. The 30d workflow does a | ||
| **Latest-window refresh** for all pypi (mirroring the latest **Window**); the daily workflow does a | ||
| 2-day **Trailing re-scan** for the critical subset. Both are idempotent (`ON CONFLICT DO UPDATE`), | ||
| fixed-window, and gap-recovered by manual **Backfill** — they are deliberately **not** self-healing. | ||
|
|
||
| ## Alternatives Considered | ||
|
|
||
| ### Alternative 1: npm-style per-package HTTP fetch with watermark due-selection | ||
| - **Pros**: reuses the npm downloads model exactly; source is scoped to what's due; naturally self-healing. | ||
| - **Cons**: requires a per-package downloads API. | ||
| - **Why not**: PyPI has no such API. The BigQuery public dataset is the only source, which forces a bulk-aggregate model. | ||
|
|
||
| ### Alternative 2: Push the critical package list into BigQuery (inline `IN UNNEST([...])`) to shrink the export | ||
| - **Pros**: smaller GCS export and staging load, especially for daily backfills. | ||
| - **Cons**: inlines our data into the query text. | ||
| - **Why not**: the critical set can grow to tens of thousands+; the inline list blows BigQuery's ~1 MB query-text limit (and Temporal's ~2 MB payload limit for the name list). Merge-scoping is unbounded and matches how every deps.dev job scopes to our data in Postgres, not at the source. A cheap `getCriticalPypiCount` guard skips the scan when there are zero critical packages. | ||
|
|
||
| ### Alternative 3: Gap-filling self-healing (npm's `computeMissingLast30dWindows` model) | ||
| - **Pros**: auto-recovers missed days/months without manual intervention. | ||
| - **Cons**: needs per-package due-selection / existing-window diffing, extra state and complexity, and re-scans BigQuery anyway. | ||
| - **Why not**: for a bulk-BQ source the simpler fixed-window + idempotent-upsert + manual **Backfill** model is sufficient; deps.dev jobs re-scan on re-run too. The daily 2-day **Trailing re-scan** already corrects a partial most-recent partition. | ||
|
|
||
| ## Consequences | ||
|
|
||
| ### Positive | ||
| - Reuses the deps.dev BQ→GCS→staging→merge plumbing and the `monitor:osspckgs` cost/row dashboard for free. | ||
| - Scoping in the merge scales to any critical-set size; our package identifiers never leave Postgres. | ||
| - Idempotent upserts make re-runs and overlapping backfills safe (no duplicate rows). | ||
|
|
||
| ### Negative | ||
| - Re-running a date range re-scans BigQuery and re-bills — there is no "already imported" skip. | ||
| - The daily 2-day window re-scans each calendar day ~2×; steady-state cost ≈ $610/yr daily + $311/yr 30d ≈ **~$920/yr** at $6.25/TiB (measured). | ||
| - Not self-healing: an outage or missed schedule is recovered only by a manual **Backfill**. | ||
| - Daily export carries all ~800k projects even though the merge keeps only the critical subset (larger data movement than a source-filtered approach). | ||
|
|
||
| ### Risks | ||
| - **BigQuery cost / runaway scans** — mitigated by per-kind `BQ_DATASET_INGEST_PYPI_DOWNLOADS_*_MAX_BQ_GB` ceilings enforced via a pre-run dry-run (aborts before billing); defaults set from measured sizes (30d = 6000 GB, daily = 2000 GB). | ||
| - **Traffic growth** — the ~4.56 TB/30d figure grows with PyPI traffic; ceilings may need raising over time. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| version: '3.1' | ||
|
|
||
| x-env-args: &env-args | ||
| DOCKER_BUILDKIT: 1 | ||
| NODE_ENV: docker | ||
| SERVICE: pypi-worker | ||
| CROWD_TEMPORAL_TASKQUEUE: pypi-worker | ||
| CROWD_TEMPORAL_NAMESPACE: ${CROWD_PACKAGES_TEMPORAL_NAMESPACE} | ||
| SHELL: /bin/sh | ||
| SUPPRESS_NO_CONFIG_WARNING: 'true' | ||
|
|
||
| services: | ||
| pypi-worker: | ||
| build: | ||
| context: ../../ | ||
| dockerfile: ./scripts/services/docker/Dockerfile.packages | ||
| command: 'pnpm run start:pypi-worker' | ||
| working_dir: /usr/crowd/app/services/apps/packages_worker | ||
| env_file: | ||
| - ../../backend/.env.dist.local | ||
| - ../../backend/.env.dist.composed | ||
| - ../../backend/.env.override.local | ||
| - ../../backend/.env.override.composed | ||
| environment: | ||
| <<: *env-args | ||
| restart: always | ||
| networks: | ||
| - crowd-bridge | ||
|
|
||
| pypi-worker-dev: | ||
| build: | ||
| context: ../../ | ||
| dockerfile: ./scripts/services/docker/Dockerfile.packages | ||
| command: 'pnpm run dev:pypi-worker' | ||
| working_dir: /usr/crowd/app/services/apps/packages_worker | ||
| # user: '${USER_ID}:${GROUP_ID}' | ||
| env_file: | ||
| - ../../backend/.env.dist.local | ||
| - ../../backend/.env.dist.composed | ||
| - ../../backend/.env.override.local | ||
| - ../../backend/.env.override.composed | ||
| environment: | ||
| <<: *env-args | ||
| hostname: pypi-worker | ||
| networks: | ||
| - crowd-bridge | ||
| volumes: | ||
| - ../../services/libs/audit-logs/src:/usr/crowd/app/services/libs/audit-logs/src | ||
| - ../../services/libs/common/src:/usr/crowd/app/services/libs/common/src | ||
| - ../../services/libs/common_services/src:/usr/crowd/app/services/libs/common_services/src | ||
| - ../../services/libs/data-access-layer/src:/usr/crowd/app/services/libs/data-access-layer/src | ||
| - ../../services/libs/database/src:/usr/crowd/app/services/libs/database/src | ||
| - ../../services/libs/integrations/src:/usr/crowd/app/services/libs/integrations/src | ||
| - ../../services/libs/logging/src:/usr/crowd/app/services/libs/logging/src | ||
| - ../../services/libs/nango/src:/usr/crowd/app/services/libs/nango/src | ||
| - ../../services/libs/opensearch/src:/usr/crowd/app/services/libs/opensearch/src | ||
| - ../../services/libs/queue/src:/usr/crowd/app/services/libs/queue/src | ||
| - ../../services/libs/redis/src:/usr/crowd/app/services/libs/redis/src | ||
| - ../../services/libs/snowflake/src:/usr/crowd/app/services/libs/snowflake/src | ||
| - ../../services/libs/telemetry/src:/usr/crowd/app/services/libs/telemetry/src | ||
| - ../../services/libs/temporal/src:/usr/crowd/app/services/libs/temporal/src | ||
| - ../../services/libs/types/src:/usr/crowd/app/services/libs/types/src | ||
| - ../../services/apps/packages_worker/src:/usr/crowd/app/services/apps/packages_worker/src | ||
|
|
||
| networks: | ||
| crowd-bridge: | ||
| external: true |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,14 @@ | ||
| import { scheduleOsspckgsBootstrap } from '../deps-dev/schedules/bootstrap' | ||
| import { | ||
| schedulePypiDownloads30d, | ||
| schedulePypiDownloadsDaily, | ||
| } from '../deps-dev/schedules/pypiDownloads' | ||
| import { svc } from '../service' | ||
|
|
||
| setImmediate(async () => { | ||
| await svc.init() | ||
| await scheduleOsspckgsBootstrap() | ||
| await schedulePypiDownloads30d() | ||
| await schedulePypiDownloadsDaily() | ||
| await svc.start() | ||
| }) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| import { schedulePypiIngest } from '../pypi/schedule' | ||
| import { svc } from '../service' | ||
|
|
||
| setImmediate(async () => { | ||
| await svc.init() | ||
| await schedulePypiIngest() | ||
| await svc.start() | ||
| }) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
12 changes: 12 additions & 0 deletions
12
services/apps/packages_worker/src/deps-dev/activities/getCriticalPypiCount.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| import { getCriticalPypiPackageCount } from '@crowd/data-access-layer' | ||
|
|
||
| import { getPackagesDb } from '../../db' | ||
|
|
||
| // Count of critical PyPI packages, so the daily downloads workflow can skip its BigQuery scan | ||
| // when there are none (the merge is scoped to is_critical, mirroring how deps.dev scopes to our | ||
| // packages in the Postgres merge rather than pushing our package list into BigQuery). | ||
| export async function getCriticalPypiCount(): Promise<{ count: number }> { | ||
| const qx = await getPackagesDb() | ||
| const count = await getCriticalPypiPackageCount(qx) | ||
| return { count } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.