fix: E2E trace-log correlation, Tempo datasource, cross-signal Grafana links#59
Merged
Conversation
…red modes, E2E tests
Add UpdatePeersWithZones, LookupAZ, StatsAZ methods to PeerCache. Handler now exposes /internal/cache/stats with AZ info for peer discovery. Update NewHandler signature to accept selfAZ parameter.
SetEndpointsWithZones classifies insert endpoints by AZ. QueryLogs and QueryTraces use getQueryEndpoints to prefer same-AZ or restrict to same-AZ only based on config (preferred vs strict mode).
Detect AZ via azdetect at startup, expose via /internal/cache/stats. RefreshDiscovery queries peer AZs and updates ring with zone info. Strict mode warns when same-AZ peers below minimum threshold.
…on, AZ config Add default topology spread constraints for even AZ distribution. Inject NODE_NAME env var for K8s API AZ detection fallback. Add peer and select AZ config defaults to values.yaml.
Mirror logs AZ wiring: azdetect at startup, SetSelfAZ on storage, AZ in /internal/cache/stats, zone-aware RefreshDiscovery with queryPeerAZs helper.
Full flow test: handlers in different AZs → peer cache routing → stats. Stats endpoint test: verify AZ exposed via JSON. Peer AZ discovery test: simulate RefreshDiscovery peer query flow. AZ detect chain tests: env var priority, all-fail empty result.
Add LAKEHOUSE_AZ=az-a env to lakehouse services in compose. E2E tests verify: AZ in /internal/cache/stats, health/ready pass with AZ detection, queries work with AZ-aware routing enabled.
…rmaid diagrams Add implementation status banner, AZ detection flowchart, peer discovery sequence diagram, write path architecture diagram. Update config examples to match actual YAML format. Update metrics and alert rules to match actual metric names. Add CHANGELOG entries.
…ge gaps - fetchPeerAZ used Authorization Bearer but handler expects X-Peer-Auth-Key - Handler.ServeHTTP read selfAZ without lock (data race with SetSelfAZ) - Use json.Marshal instead of %q for valid JSON with non-UTF8 AZ names - mergeConfig missed boolean fields: Peer.AZAware, CrossAZFallback, Select.* - Add integration tests for IMDS/GCP metadata edge cases - Add config merge tests for AZ field overlay behavior
8 fuzz targets across azdetect, peercache, config, and storage: - FuzzDetect_EnvVar, FuzzDetectAWSIMDS, FuzzDetectGCPMetadata - FuzzValidateAZMode, FuzzRingLookupAZ, FuzzRingLookupAZ_NoSameAZ - FuzzHandlerStatsEndpoint, FuzzStorageFetchPeerAZ Edge case tests covering: - K8s node label parsing (GA/legacy labels, missing token, unreachable) - Ring with empty zones, single peer, overwrite, large ring (100 peers) - Handler concurrent access, special chars in AZ (unicode, quotes, newlines) - Buffer bridge all-same-AZ, all-cross-AZ, mixed modes, endpoint upgrade - Storage peer AZ with HTTP 500, extra JSON fields, auth combinations - Config defaults verification, AZ mode validation (3 valid, 8 invalid)
Adds [Unreleased] changelog entries for auto-release to materialize: - AZ auto-detection, peer cache routing, buffer bridge, metrics - Bug fixes: auth header mismatch, data race, invalid JSON, config merge - 8 fuzz targets, 60+ edge case tests
Handle all unchecked error returns (resp.Body.Close, json.Decode, os.Setenv, fmt.Fprintf, w.Write). Use t.Setenv in tests for automatic cleanup. Replace numeric HTTP status codes with constants. Add #nosec G704 annotation for K8s API URL with hardcoded host.
- Rewrite datagen to generate traces first, then correlate 70% of logs with matching trace IDs, span IDs, and service context - Append trace_id=xxx span_id=xxx to all log bodies for derivedFields - Add derivedFields to all VictoriaLogs datasources (Global, Hot, Cold) linking logs to their respective Jaeger trace datasources - Update Loki derivedFields to use trace_id=(\w+) regex matching - Fix ClickHouse otel_logs view: put promoted fields in LogAttributes only (no duplication in ResourceAttributes) - Bump loki-vl-proxy to v1.33.0
- Add Tempo datasource at /select/tempo (VT v0.8.0+ Tempo API v2) - Point Loki derivedFields to Tempo for log→trace correlation - Sync CHANGELOG.md Unreleased section with main (v0.23.0 already released)
…atasources # Conflicts: # CHANGELOG.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
derivedFieldsto all VictoriaLogs and Loki datasources for clickable log→trace links in Grafana Exploreotel_logsview: promoted fields in LogAttributes only (removed ResourceAttributes duplication)Changes
cmd/datagen/main.go: Phase 1 generates traces collectingtraceCtx, Phase 2 generates logs with 70% correlated to existing traces. Log bodies includetrace_id=xxx span_id=xxxfor derivedFields regex extraction.datasources.yaml: AddedderivedFieldswithtrace_id=(\w+)regex to VL Global, VL Hot, Lakehouse Cold, and Loki datasources. Added Tempo datasource athttp://vtselect:10428/select/tempowithtracesToLogsV2pointing to Loki proxy.init-s3.sql: ClickHouse otel_logs view puts promoted fields into LogAttributes map only, removing ResourceAttributes duplication.Dockerfile.loki-vl-proxy: Bump to v1.33.0docker-compose-e2e.yml: loki-vl-proxy label-style config updateCHANGELOG.md: Synced Unreleased section with main, added entries for all fixesCorrelation Matrix
Known Issue
VictoriaTraces Tempo search API returns incorrect
startTimeUnixNanoand overflowsdurationMs. Fix submitted upstream: VictoriaMetrics/VictoriaTraces#153. Tags and trace-by-ID endpoints work correctly. Trace correlation via Loki→Tempo derivedFields works as a workaround.Test plan