Skip to content

fix(analytics): define top-session active duration as clamped idle gaps (alt to #867)#869

Merged
wesm merged 10 commits into
mainfrom
feat/top-sessions-active-duration-clamp
Jun 27, 2026
Merged

fix(analytics): define top-session active duration as clamped idle gaps (alt to #867)#869
wesm merged 10 commits into
mainfrom
feat/top-sessions-active-duration-clamp

Conversation

@mariusvniekerk

Copy link
Copy Markdown
Collaborator

Alternative to #867. This branch is built on @rodboev's commits (kept in
the history, so the original work and authorship are preserved); it changes
only how "active duration" is computed.

What changed

#867 computes a session's active duration by summing only the gaps that
follow a tool-use message. That captures tool-execution latency but
discards all model generation and thinking time — including the time spent
before the first tool call — so a session that reasons for minutes and then
fires one quick tool is scored as nearly idle.

This branch instead defines active duration the same way the existing
velocity "active minutes" metric already does: the sum of consecutive
inter-message gaps, with each gap capped at 5 minutes. Every gap counts —
model generation, tool execution, and quick human turnarounds — and only
stretches longer than the cap are bounded as idle.

Why

Message timestamps alone can't tell a 20-minute subagent run from a
20-minute coffee break; both are one long gap. Capping each gap fails
gracefully in both directions (a long idle gap and a long active gap each
contribute at most 5 minutes) instead of guessing, and it keeps the metric
robust to messy data: resumed sessions, machine sleep, and API stalls
produce the largest gaps, and those would otherwise dominate the sum.

It also unifies the definition. The 5-minute cap is hoisted to a single
shared constant (db.ActiveGapCapSec / ActiveGapCapMs) used by both the
velocity metric and Top Sessions, so the two "active" numbers on the
dashboard can't drift; a test asserts the seconds and milliseconds forms
agree.

Tradeoff

A genuinely long active operation — a 15-minute test run, a 20-minute
subagent — is also capped at 5 minutes, so it is undercounted. That is the
deliberate cost of not guessing whether a long gap was work or idle; the
error is bounded and symmetric.

Where to look

  • internal/db/analytics.go — shared constant, SQLite active-duration SQL,
    and the velocity metric now sourcing the same cap.
  • internal/db/timing.go — SQLite Go fallback (timezone-aware ranking path)
    running the same clamp.
  • internal/postgres/analytics.go, internal/duckdb/analytics_usage.go
    PostgreSQL and DuckDB twins.

All three SQL backends drop the has_tool_use filter and the trailing gap
to ended_at so the value matches the velocity computation.

generated by a clanker

rodboev and others added 8 commits June 25, 2026 18:27
Top Sessions "By Duration" introduced an active-duration metric that
summed only the gaps following a tool-use message. That captured
tool-execution latency but discarded all model generation and thinking
time, including the stretch before the first tool call, so it
systematically undercounted how long a session was actually worked on.

Redefine active duration as the sum of consecutive inter-message gaps
with each gap capped at five minutes, which is exactly the definition the
velocity "active minutes" metric already uses. A gap below the cap counts
in full (generation, tool execution, or a quick human turnaround); only
stretches longer than the cap are bounded as idle. Timestamps alone
cannot distinguish a long tool run from a human stepping away, so capping
fails gracefully in both directions instead of guessing.

Hoist the cap to a single shared constant (ActiveGapCapSec /
ActiveGapCapMs) so the two "active" definitions cannot drift, guarded by a
test that the seconds and milliseconds forms agree. The active-duration
SQL drops the has_tool_use filter and the trailing gap to ended_at to
match the velocity computation across SQLite, PostgreSQL, and DuckDB, and
the SQLite Go fallback runs the same clamp.

Alternative implementation to #83, built on its commits.
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (0915ffc)

Medium issue found.

Medium

  • internal/db/analytics.go:4535 - SQLite timezone fallback calls activeDurationMinForSession while the outer rows iterator remains open. This can hold one reader connection and acquire another per row; with a capped reader pool, concurrent duration fallback requests can exhaust the pool and block indefinitely. It also introduces an unbounded N+1 query pattern.
    • Fix: compute active duration in the outer query, or collect candidate row IDs first, close rows, then batch-load active durations before sorting.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 6m12s), codex_security (codex/security, done, 1m7s) | Total: 7m26s

The timezone-aware Go fallback for Top Sessions "By Duration" computed
each row's active duration with a per-row follow-up query while the outer
rows iterator was still open. That iterator pins one reader connection,
and every row then borrowed a second one, an unbounded N+1 over the
result set. The reader pool is capped at four connections, so a handful
of concurrent duration-fallback requests could each hold their outer
cursor while contending for inner connections and deadlock until the busy
timeout or context cancellation freed them.

Compute active duration in the outer query instead, exactly as the in-SQL
and PostgreSQL paths already do, so the whole result set comes back in one
statement with no second connection per row. The correlated-subquery SQL
is hoisted into a shared sqliteActiveDurationExpr helper used by both
SQLite paths, keeping the clamp definition from drifting, and is only
emitted for the duration metric so other metrics pay nothing for it.
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (665e53a)

Summary verdict: one Medium issue remains; no High or Critical findings were reported.

Medium

  • internal/postgres/analytics.go:2208, internal/db/analytics.go:4500 - Duration ranking eligibility is inconsistent across backends. PostgreSQL and the SQLite Go fallback rank by active_duration_min but only require non-null start/end times, unlike the SQLite SQL and DuckDB paths. Reversed durations such as ended_at < started_at can still rank highly if message-gap active duration is positive, and the SQLite fallback also does not reject empty timestamp strings.

    Suggested fix: Mirror SQLite/DuckDB eligibility in both paths: add ended_at >= started_at for PostgreSQL, and use NULLIF(..., '') plus julianday(ended_at) >= julianday(started_at) in the SQLite fallback. Add parity tests for reversed and empty duration rows.


Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 6m40s), codex_security (codex/security, done, 38s) | Total: 7m25s

Top Sessions "By Duration" ranks by message-gap active duration, which
stays positive even when a session's wall-clock timestamps are reversed
(ended_at < started_at) or empty. The SQLite SQL and DuckDB paths already
exclude such rows, but the PostgreSQL path and the timezone-aware SQLite
Go fallback only checked for NULL timestamps. As a result the same
malformed session could rank into the top list on those two backends
while being filtered out on the others, breaking backend parity.

Add the missing guards: ended_at >= started_at for PostgreSQL, and the
NULLIF(..., '') emptiness checks plus julianday(ended_at) >=
julianday(started_at) for the SQLite fallback, matching the canonical
SQLite SQL and DuckDB eligibility. Parity tests cover reversed rows on
all affected paths and empty-string rows on SQLite; the fallback empty
case pins created_at into the filter window so it reaches the guard
instead of being dropped by the date range.
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (9d33b21)

No issues found.


Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 6m8s), codex_security (codex/security, done, 14s) | Total: 6m22s

@wesm

wesm commented Jun 27, 2026

Copy link
Copy Markdown
Member

LGTM here

@wesm wesm merged commit 634d930 into main Jun 27, 2026
23 checks passed
@wesm wesm deleted the feat/top-sessions-active-duration-clamp branch June 27, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants