Skip to content

Fix #1897 recovery replay request storm#1939

Open
TianchunGu wants to merge 4 commits into
MemTensor:dev-v2.0.22from
TianchunGu:codex/fix-1897-recovery-storm
Open

Fix #1897 recovery replay request storm#1939
TianchunGu wants to merge 4 commits into
MemTensor:dev-v2.0.22from
TianchunGu:codex/fix-1897-recovery-storm

Conversation

@TianchunGu

Copy link
Copy Markdown

Summary

Follow-up for #1897, stacked on top of #1938.

#1899 / #1938 stop the terminal-provider-error retry storm at the LLM facade. This PR adds a second guardrail for the capture pipeline: startup recovery of dirty closed episodes can replay a large historical episode, misclassify recovered steps as orphan traces, and then trigger many reflect/summarize LLM calls.

This patch:

  • fetches all episode traces for capture reconciliation instead of using the paginated viewer list() path;
  • matches reflect replay steps by stable content signature, with a timing-insensitive fallback for recovered tool-call traces that lack startedAt/endedAt;
  • skips recovered replay orphan inserts by default (maxRecoveryOrphanInserts: 0) to avoid duplicating historical traces;
  • avoids LLM summarization for any recovered replay orphan insert that an operator explicitly allows;
  • adds a hard per-episode reflect LLM budget (maxReflectLlmCalls, default 128);
  • stops remaining reflect LLM attempts after terminal/circuit-open provider errors.

The goal is to keep normal short-episode reflection behavior intact while preventing a single dirty recovered episode from producing unbounded paid LLM calls.

Validation

From apps/memos-local-plugin:

npx vitest run tests/unit/capture/capture.test.ts tests/unit/capture/capture-batch.test.ts tests/unit/capture/normalizer.test.ts tests/unit/llm/client.test.ts
# Test Files 4 passed (4)
# Tests 57 passed (57)

npm run lint
# tsc -p tsconfig.json --noEmit
# passed

New regression coverage:

  • recovered replay matches tool traces by payload when timestamps drift;
  • startup-recovered replay orphans are not inserted by default;
  • reflect-phase LLM calls are capped per episode.

Related: #1897, #1899, #1938

autodev-bot and others added 4 commits June 8, 2026 20:34
…for terminal provider errors

Issue MemTensor#1897 reported ~12,900 paid LLM requests in 24 h on Hermes
against a DeepSeek key with insufficient balance. The local
`system_model_status` row count (12,900) closely tracked the
provider-side `request_count` (11,344) for the same billing window.
The naming is misleading: `system_model_status` is not a health probe;
it is the audit row written once per LLM call (ok / fallback / error)
inside `core/llm/client.ts`. With no circuit breaker, every pipeline
subscriber (capture / session-relation / reward / L2 / L3 / skill /
retrieval LLM filter / world-model) kept firing on every turn / closed
episode / induction, generating one paid request each.

Add a per-`LlmClient` circuit breaker:

- Trips on terminal errors: HTTP 401/402/403 or messages containing
  `insufficient balance` / `invalid api key` / `unauthorized` /
  `account suspended` / `billing`.
- Open: short-circuits subsequent calls inside the facade without
  contacting the provider. Throws `MemosError(LLM_UNAVAILABLE)` with
  `details.circuitOpen=true` so existing catch blocks still work.
- Half-open after cool-down (default 5 min, configurable, min 30 s):
  next call probes the provider; success closes the breaker, terminal
  failure re-opens it for another cool-down.
- Host fallback rescues a call without tripping the breaker —
  fallback exists precisely to keep going when the primary is down.
- Coalesces `system_model_status="circuit_open"` audit rows to at
  most one per ~25 s while the breaker stays open, so we don't
  replace paid spam with audit-row spam.
- Exposes `circuitOpen` / `circuitOpenUntil` / `circuitOpenedReason`
  via `LlmClientStats` for the Overview viewer card.
- Enabled by default; legacy behaviour available via
  `circuitBreaker.enabled = false`.

Tests: 9 new vitest cases under `tests/unit/llm/client.test.ts`
covering trip on 402, trip on "insufficient balance" message, no
trip on generic transient, coalescing, half-open close on success,
host-fallback rescues without trip, disabled mode, stats fields,
and re-open on terminal probe failure. All 59 LLM and 28 pipeline
tests pass; `tsc --noEmit` clean.

Out of scope (tracked separately): 429 `Retry-After` handling
(issue MemTensor#1620), per-tool rate limits, daily budget caps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants