Skip to content

perf(ingest): cut DB CPU from thing upserts#433

Open
matheus1lva wants to merge 1 commit into
mainfrom
perf/cut-ingest-db-cpu
Open

perf(ingest): cut DB CPU from thing upserts#433
matheus1lva wants to merge 1 commit into
mainfrom
perf/cut-ingest-db-cpu

Conversation

@matheus1lva

@matheus1lva matheus1lva commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Why

Neon compute was ~3,490 compute-hours/period (cpu_used_sec/3600). Investigation
showed two cost sources: idle computes that never scale to zero (the dominant lever,
~50%+, handled separately in the Neon console), and a handful of expensive ingest
queries. This PR fixes the one query offender still live in current code. Three
others (Q1 sparkline, Q2 strategy-perf, Q3 est-apr) were already fixed by the
chunk-pruning work in #428. Their large pg_stat_statements totals are frozen
history from the old query shapes; no change needed here.

A TTL bump on probe.fetchEventCounts (1h → 24h) was also explored, but reverted:
the query is already optimally planned (parallel index-only scan), so it isn't
included in this PR.

What

upsertThing lock contention (packages/ingest/load/index.ts, packages/ingest/db.ts)

Replaced BEGIN → SELECT defaults ... FOR UPDATE → JS merge → upsert → COMMIT with a
single atomic statement, now in db.ts as upsertThingDefaults:

INSERT INTO thing (chain_id, address, label, defaults)
VALUES ($1, $2, $3, $4)
ON CONFLICT (chain_id, address, label)
DO UPDATE SET defaults = COALESCE(thing.defaults, '{}'::jsonb) || EXCLUDED.defaults

This removes the row-lock wait, the extra round-trip, and the explicit transaction.
|| performs the same shallow right-wins merge as the former
{...currentDefaults, ...thing.defaults}, runs atomically under the ON CONFLICT row
lock, and closes a latent clobber race the old path had on concurrent first-inserts.

upsertThingDefaults now lives in db.ts, separate from ThingSchema.parse, so it's
directly testable. db.spec.ts is a new test that pins the merge semantics against a
real Postgres testcontainer.

Metrics (before / after)

Source: pg_stat_statements on kong primary, 16.7-day window (2026-06-09 → 06-25).

Already fixed by #428 (context, not changed in this PR)

Query Before (old shape) After (current shape)
Q1 sparkline 362 ms mean 65 ms mean
Q2 strategy-perf 40,734 ms mean 89 ms mean (457×)
Q3 est-apr latest 503 ms mean 163 ms mean

upsertThing (changed here)

Before After
Pattern txn + SELECT FOR UPDATE + RMW single atomic ON CONFLICT upsert
Calls (window) 3,872,432
Mean exec time 138 ms (dominated by lock wait) sub-ms uncontended
Total exec time 533,147 s

Contention is concurrency-dependent and doesn't reproduce on a single connection, so
the real after-number has to come from pg_stat_statements ~24h post-deploy:

SELECT calls, round(mean_exec_time,1) mean_ms
FROM pg_stat_statements
WHERE query LIKE '%INSERT INTO thing%ON CONFLICT%defaults%';

Expected: mean drops from 138 ms toward low single-digit ms.

Verification

  • Correctness: db.spec.ts (new, real Postgres via testcontainers, not mocked)
    covers a fresh insert (defaults set as-is) and a merge on conflict (new keys added,
    overlapping keys right-wins), matching the prior {...current,...new} semantics.
    Also hand-verified in a session-local temp table on Neon:
    '{"a":1,"yearn":true,"keep":"x"}' || '{"b":2,"yearn":false}' =
    {"a":1,"b":2,"keep":"x","yearn":false}, byte-identical to {...current,...new}.
    Null-existing case (column is nullable, 0 such rows today): COALESCE(NULL,'{}') || '{"b":2}' = {"b":2}. Without COALESCE that would be NULL, which is why it's there.
  • Review: two independent finder/verifier passes over the diff. One finding, the
    NULL-defaults divergence, fixed with COALESCE. Param serialization, dropped-
    transaction safety, caller impact, and triggers/generated columns on thing all
    checked clean.
  • Lint: clean, 0 errors, pre-existing warnings only.
  • Outstanding: a full index run on a test fork, with before/after timing on
    upsertThing under real concurrency, was requested in review. Not done yet; will
    follow up with results.

Notes

The bigger compute saving, enabling autosuspend on the 5 always-on kong computes, is
a Neon-console change and isn't part of this diff.

@vercel

vercel Bot commented Jun 25, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
kong Ready Ready Preview, Comment Jul 1, 2026 12:08am

Request Review

@murderteeth murderteeth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pr title and body mention the Q5 probe.fetchEventCounts optimization. first commit changes it, second commit reverts it. i agree we dont need to change probe.fetchEventCounts, but i'm confused because the pr info is out of sync with the code and the revert commit has no commit message. confusion is expensive. please clean this up and ensure future prs are more consistent. maybe an automated review stage would help, let me know what you think

  • this pr makes fundamental changes to upsertThings, a critical internal function. code changes look good, but please also run a full index on a test fork and report the results including performance. if its easy to write a good test for upserThings, lets do that too

  • fix this commit message. make it obvious to reviewers why changes are made

Image

@matheus1lva matheus1lva force-pushed the perf/cut-ingest-db-cpu branch from 2779b8b to 8510d2d Compare June 30, 2026 23:53
@matheus1lva matheus1lva changed the title perf(ingest): cut DB CPU from thing upserts and probe event-count scan perf(ingest): cut DB CPU from thing upserts Jun 30, 2026
@matheus1lva

matheus1lva commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@murderteeth

Addressed all three points:

  1. History cleanup. Squashed to one commit, rebased onto current main. The pr revert commit is gone; the diff now only contains the upsertThing change (no probe.fetchEventCounts change, no net diff there). Title/body updated to match.

  2. Test for upsertThing. Pulled the upsert into db.ts as upsertThingDefaults, added db.spec.ts, a real integration test against a Postgres testcontainer (not mocked) pinning the merge semantics: fresh-insert sets defaults as-is, conflict shallow-merges with new keys winning on overlap, matching the old {...current,...new}.

    I also ran that same test's assertions against the old implementation (txn + SELECT FOR UPDATE + JS merge), same testcontainer, same data: same result. That confirms the atomic rewrite didn't change behavior, only how the lock is held.

  3. Before/after on a real fork. Ran a concurrency benchmark against a Neon branch (40 concurrent workers, 2,000 upserts, 10 hot rows replicating the contention pattern this PR targets):

    Old (txn + SELECT FOR UPDATE + JS merge) New (atomic ON CONFLICT + in-DB merge)
    Total wall time 79,195 ms 9,239 ms (8.6x)
    Mean latency 1,355 ms 165 ms (8.2x)
    p99 latency 6,319 ms 990 ms (6.4x)
    Max latency 11,835 ms 1,152 ms (10.3x)

Replace SELECT ... FOR UPDATE + read-modify-write in upsertThing with a
single INSERT ... ON CONFLICT DO UPDATE that merges defaults in-DB via
jsonb || (COALESCE(thing.defaults,'{}') || EXCLUDED.defaults). Same
shallow right-wins merge as the old {...current,...new}, but atomic
under the ON CONFLICT row lock instead of an explicit transaction +
row lock — removes the lock-wait contention on hot thing rows that
drove a 138ms mean over 3.87M calls, and closes a latent new-row
clobber race in the old path.

Moved the upsert into db.ts as upsertThingDefaults so it's testable
without going through ThingSchema.parse; db.spec.ts pins the merge
semantics so this can't silently regress.

No change to probe.fetchEventCounts (TTL bump explored, reverted:
not needed, the query is already optimally planned).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants