Skip to content

fix: E2E trace-log correlation, Tempo datasource, cross-signal Grafana links#59

Merged
szibis merged 26 commits into
mainfrom
fix/e2e-correlation-datasources
May 14, 2026
Merged

fix: E2E trace-log correlation, Tempo datasource, cross-signal Grafana links#59
szibis merged 26 commits into
mainfrom
fix/e2e-correlation-datasources

Conversation

@szibis

@szibis szibis commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Rewrite datagen for trace-log correlation: traces generated first, 70% of logs share matching trace IDs, span IDs, and service context
  • Add derivedFields to all VictoriaLogs and Loki datasources for clickable log→trace links in Grafana Explore
  • Add Tempo datasource backed by VictoriaTraces Tempo v2 API for Loki→Tempo trace correlation
  • Fix ClickHouse otel_logs view: promoted fields in LogAttributes only (removed ResourceAttributes duplication)
  • Bump loki-vl-proxy to v1.33.0
  • Sync CHANGELOG with main branch

Changes

  • cmd/datagen/main.go: Phase 1 generates traces collecting traceCtx, Phase 2 generates logs with 70% correlated to existing traces. Log bodies include trace_id=xxx span_id=xxx for derivedFields regex extraction.
  • datasources.yaml: Added derivedFields with trace_id=(\w+) regex to VL Global, VL Hot, Lakehouse Cold, and Loki datasources. Added Tempo datasource at http://vtselect:10428/select/tempo with tracesToLogsV2 pointing to Loki proxy.
  • init-s3.sql: ClickHouse otel_logs view puts promoted fields into LogAttributes map only, removing ResourceAttributes duplication.
  • Dockerfile.loki-vl-proxy: Bump to v1.33.0
  • docker-compose-e2e.yml: loki-vl-proxy label-style config update
  • CHANGELOG.md: Synced Unreleased section with main, added entries for all fixes

Correlation Matrix

From → To VL Global VL Hot Lakehouse Cold Loki Tempo ClickHouse
Log → Trace derivedFields→Jaeger derivedFields→Jaeger derivedFields→Jaeger derivedFields→Tempo
Trace → Log tracesToLogsV2→Loki tracesToLogsV2→CH Logs

Known Issue

VictoriaTraces Tempo search API returns incorrect startTimeUnixNano and overflows durationMs. Fix submitted upstream: VictoriaMetrics/VictoriaTraces#153. Tags and trace-by-ID endpoints work correctly. Trace correlation via Loki→Tempo derivedFields works as a workaround.

Test plan

  • CI passes (test, lint, build, docker)
  • E2E tests pass with new datagen correlation
  • Verify in Grafana: VL/Loki logs show clickable trace link buttons
  • Verify in Grafana: Clicking trace link opens Jaeger/Tempo trace view
  • Verify in Grafana: ClickHouse logs show all promoted fields in LogAttributes

szibis added 22 commits May 14, 2026 21:11
Add UpdatePeersWithZones, LookupAZ, StatsAZ methods to PeerCache.
Handler now exposes /internal/cache/stats with AZ info for peer
discovery. Update NewHandler signature to accept selfAZ parameter.
SetEndpointsWithZones classifies insert endpoints by AZ. QueryLogs
and QueryTraces use getQueryEndpoints to prefer same-AZ or restrict
to same-AZ only based on config (preferred vs strict mode).
Detect AZ via azdetect at startup, expose via /internal/cache/stats.
RefreshDiscovery queries peer AZs and updates ring with zone info.
Strict mode warns when same-AZ peers below minimum threshold.
…on, AZ config

Add default topology spread constraints for even AZ distribution.
Inject NODE_NAME env var for K8s API AZ detection fallback.
Add peer and select AZ config defaults to values.yaml.
Mirror logs AZ wiring: azdetect at startup, SetSelfAZ on storage,
AZ in /internal/cache/stats, zone-aware RefreshDiscovery with
queryPeerAZs helper.
Full flow test: handlers in different AZs → peer cache routing → stats.
Stats endpoint test: verify AZ exposed via JSON.
Peer AZ discovery test: simulate RefreshDiscovery peer query flow.
AZ detect chain tests: env var priority, all-fail empty result.
Add LAKEHOUSE_AZ=az-a env to lakehouse services in compose.
E2E tests verify: AZ in /internal/cache/stats, health/ready pass
with AZ detection, queries work with AZ-aware routing enabled.
…rmaid diagrams

Add implementation status banner, AZ detection flowchart, peer
discovery sequence diagram, write path architecture diagram.
Update config examples to match actual YAML format. Update metrics
and alert rules to match actual metric names. Add CHANGELOG entries.
…ge gaps

- fetchPeerAZ used Authorization Bearer but handler expects X-Peer-Auth-Key
- Handler.ServeHTTP read selfAZ without lock (data race with SetSelfAZ)
- Use json.Marshal instead of %q for valid JSON with non-UTF8 AZ names
- mergeConfig missed boolean fields: Peer.AZAware, CrossAZFallback, Select.*
- Add integration tests for IMDS/GCP metadata edge cases
- Add config merge tests for AZ field overlay behavior
8 fuzz targets across azdetect, peercache, config, and storage:
- FuzzDetect_EnvVar, FuzzDetectAWSIMDS, FuzzDetectGCPMetadata
- FuzzValidateAZMode, FuzzRingLookupAZ, FuzzRingLookupAZ_NoSameAZ
- FuzzHandlerStatsEndpoint, FuzzStorageFetchPeerAZ

Edge case tests covering:
- K8s node label parsing (GA/legacy labels, missing token, unreachable)
- Ring with empty zones, single peer, overwrite, large ring (100 peers)
- Handler concurrent access, special chars in AZ (unicode, quotes, newlines)
- Buffer bridge all-same-AZ, all-cross-AZ, mixed modes, endpoint upgrade
- Storage peer AZ with HTTP 500, extra JSON fields, auth combinations
- Config defaults verification, AZ mode validation (3 valid, 8 invalid)
Adds [Unreleased] changelog entries for auto-release to materialize:
- AZ auto-detection, peer cache routing, buffer bridge, metrics
- Bug fixes: auth header mismatch, data race, invalid JSON, config merge
- 8 fuzz targets, 60+ edge case tests
Handle all unchecked error returns (resp.Body.Close, json.Decode,
os.Setenv, fmt.Fprintf, w.Write). Use t.Setenv in tests for automatic
cleanup. Replace numeric HTTP status codes with constants. Add #nosec
G704 annotation for K8s API URL with hardcoded host.
- Rewrite datagen to generate traces first, then correlate 70% of logs
  with matching trace IDs, span IDs, and service context
- Append trace_id=xxx span_id=xxx to all log bodies for derivedFields
- Add derivedFields to all VictoriaLogs datasources (Global, Hot, Cold)
  linking logs to their respective Jaeger trace datasources
- Update Loki derivedFields to use trace_id=(\w+) regex matching
- Fix ClickHouse otel_logs view: put promoted fields in LogAttributes
  only (no duplication in ResourceAttributes)
- Bump loki-vl-proxy to v1.33.0
@github-actions github-actions Bot added size/XL Extra large change scope/config Config changes scope/docs Documentation scope/helm Helm chart scope/metrics Metrics/observability scope/peercache Peer cache scope/tests Test suite labels May 14, 2026
szibis added 2 commits May 15, 2026 00:04
- Add Tempo datasource at /select/tempo (VT v0.8.0+ Tempo API v2)
- Point Loki derivedFields to Tempo for log→trace correlation
- Sync CHANGELOG.md Unreleased section with main (v0.23.0 already released)
@szibis szibis changed the title fix: add trace-log correlation and cross-datasource trace links fix: E2E trace-log correlation, Tempo datasource, and cross-signal Grafana links May 14, 2026
@szibis szibis changed the title fix: E2E trace-log correlation, Tempo datasource, and cross-signal Grafana links fix: E2E trace-log correlation, Tempo datasource, cross-signal Grafana links May 14, 2026
@github-actions github-actions Bot added size/L Large change and removed size/XL Extra large change labels May 14, 2026
@szibis szibis merged commit 3bd1f76 into main May 14, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope/config Config changes scope/docs Documentation scope/helm Helm chart scope/metrics Metrics/observability scope/peercache Peer cache scope/tests Test suite size/L Large change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant