Skip to content

execution/binary: make duplicate series error message deterministic#711

Merged
fpetkovski merged 1 commit into
thanos-io:mainfrom
sandy2008:fix/binary-deterministic-duplicate-series-error
May 29, 2026
Merged

execution/binary: make duplicate series error message deterministic#711
fpetkovski merged 1 commit into
thanos-io:mainfrom
sandy2008:fix/binary-deterministic-duplicate-series-error

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

The errManyToManyMatch.Error() method previously formatted the two conflicting series in whichever order they happened to arrive on the original / duplicate slots. That order is driven by upstream StepVector iteration which is itself fed by Go map iteration (see execution/scan/subquery.go), so the rendered error string is non-deterministic across runs and across parallel engine workers.

Same root cause as the upstream Prometheus issue/PR (prometheus/prometheus#18809 / prometheus/prometheus#18810). Tracked downstream as a flaky test in Cortex: cortexproject/cortex#7546 (tactical workaround in cortexproject/cortex#7550 — removable once both upstream fixes land and are vendored).

Fix: lexicographically sort the two label-set strings before formatting the message. One-line if-swap, no new imports, preserves the existing message format so existing matchers keep working.

A regression test in the new execution/binary/utils_test.go exercises both right-hand and left-hand side cases as well as adversarial label values containing the same framing punctuation ([, ], ,, {, }, "), and asserts byte-identical output regardless of which series came first. Verified by temporarily reverting the swap: all subtests FAIL with the unsorted ordering; restoring the patch makes them PASS.

Fixes #710.

The errManyToManyMatch.Error() method previously formatted the two
conflicting series in whichever order they happened to arrive on the
"original" / "duplicate" slots. That order is driven by upstream
StepVector iteration which is itself fed by Go map iteration (see
execution/scan/subquery.go), so the rendered error string was
non-deterministic across runs and across parallel engine workers.

Downstream consumers (e.g. Cortex's distributor / querier) compare
error strings to classify failures, and the flapping ordering caused
flaky tests and inconsistent user-visible error messages -- same root
cause as the issue tracked in cortexproject/cortex#7546.

Fix: lexicographically sort the two label-set strings before
formatting the message. This is a one-line if-swap, no new imports
required, and preserves the existing message format so existing
matchers continue to work.

A regression test exercises both right-hand and left-hand side cases
as well as adversarial label values containing the same framing
punctuation ('[', ']', ',', '{', '}', '"') and asserts byte-identical
output regardless of which series came first.

Refs: cortexproject/cortex#7546
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

@fpetkovski fpetkovski left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@fpetkovski fpetkovski merged commit f831b43 into thanos-io:main May 29, 2026
6 of 7 checks passed
sandy2008 added a commit to sandy2008/cortex-1 that referenced this pull request May 30, 2026
…-series quirk

Addresses review feedback on cortexproject#7550. Replaces the generic bracket-list
canonicalizer (errBracketListRE / canonicalizeErrMsg / splitTopLevelCommas)
with a single-purpose, anchored duplicateSeriesRE + canonicalizeDuplicateSeriesErr
that only normalizes the non-deterministic two-element list in the PromQL
"found duplicate series for the match group" error. The regex matches the match
group and both entries as whole quote-aware labelsets, so a "," inside a labelset
is never mistaken for the entry separator; the two-element shape is hard-coded so
the comparator falls back to a strict (failing) compare if upstream reformats.

Also restores the *promv1.Error.Detail comparison that the original cmp.Equal
performed, and expands TestSameErrorClass to 19 subtests (multi-label reorder,
label values containing brackets/commas/braces/escaped quotes, nested brackets,
same list with different surrounding text, three-element and changed-separator
loud-break guards, and an unrelated-bracket-list anchoring guard).

Remove this workaround once prometheus/prometheus#18810 and
thanos-io/promql-engine#711 are vendored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
friedrichg pushed a commit to cortexproject/cortex that referenced this pull request Jun 19, 2026
#7550)

* fix(integration): compare query fuzz errors by class, not exact string

Three call sites in integration/query_fuzz_test.go compared two
Prometheus HTTP API errors with cmp.Equal(tc.err1, tc.err2), which is
a strict byte-wise string compare on the underlying *promv1.Error.Msg.
When a fuzz query exercised a vector-matching path, the upstream
engine produced messages of shape

  execution: found duplicate series for the match group ...
    [{__name__="test_series_a", series="2"}, {series="2"}]; ...

The contents of the [...] group are built from Go-map iteration in
vendor/github.com/prometheus/prometheus/promql/engine.go, so the two
entries can appear in either order between Cortex and the reference
Prometheus, even though the error class and the underlying series set
are identical. The strict compare then reported a divergence that did
not exist. Issue #7546 captures the full census.

Fix: add sameErrorClass(err1, err2) backed by two helpers in the same
file. canonicalizeErrMsg sorts the contents of every [...] group on
top-level commas; splitTopLevelCommas is brace-depth aware so commas
inside a Prometheus labelset literal ({a="1", b="2"}) are NOT split.
When both errors unwrap to *promv1.Error via errors.As, the
comparator requires equal Type AND equal canonicalized Msg; otherwise
it falls back to comparing canonicalized Error() strings. Replace
the three strict-string sites in TestDisableChunkTrimmingFuzz
(~:408), TestExpandedPostingsCacheFuzz (~:644), and the shared
runQueryFuzzTestCases helper (~:1959) used by ~ten other fuzz tests
in this file.

Add TestSameErrorClass with nine subtests, including the explicit
"same typed class, materially different messages (must not mask)"
guard — two errors with the same *promv1.Error.Type but truly
different messages must still diverge, so the comparator cannot
silently swallow a real Cortex-vs-Prometheus regression.

Fixes #7546

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

* ci: re-run after upstream/master rebase

Failing TestVerticalShardingFuzz on the prior CI run was the same
pre-existing flake tracked in #7547 / fixed by #7551 — not a
regression from this PR's error-comparator change.

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

* fix(integration): narrow query-fuzz error comparator to the duplicate-series quirk

Addresses review feedback on #7550. Replaces the generic bracket-list
canonicalizer (errBracketListRE / canonicalizeErrMsg / splitTopLevelCommas)
with a single-purpose, anchored duplicateSeriesRE + canonicalizeDuplicateSeriesErr
that only normalizes the non-deterministic two-element list in the PromQL
"found duplicate series for the match group" error. The regex matches the match
group and both entries as whole quote-aware labelsets, so a "," inside a labelset
is never mistaken for the entry separator; the two-element shape is hard-coded so
the comparator falls back to a strict (failing) compare if upstream reformats.

Also restores the *promv1.Error.Detail comparison that the original cmp.Equal
performed, and expands TestSameErrorClass to 19 subtests (multi-label reorder,
label values containing brackets/commas/braces/escaped quotes, nested brackets,
same list with different surrounding text, three-element and changed-separator
loud-break guards, and an unrelated-bracket-list anchoring guard).

Remove this workaround once prometheus/prometheus#18810 and
thanos-io/promql-engine#711 are vendored.

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

---------

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

execution/binary: "found duplicate series for the match group" error is non-deterministic

2 participants