diff --git a/docs/development/debugging.md b/docs/development/debugging.md index 68a3adaad8..6cc0a04f34 100644 --- a/docs/development/debugging.md +++ b/docs/development/debugging.md @@ -1,5 +1,11 @@ # How To Debug +## Controlling Log Level + +For a full reference on log-level controls — including which pods are affected, YAML snippets for every component, and advanced `RUST_LOG` filtering for data-plane pods — see [Log Levels](../user-guide/reference/configuration/log-levels.md). + +## Debug Logs + To enable debug logs in a Vertex Pod, set environment variable `NUMAFLOW_DEBUG` to `true` for the Vertex. For example: ```yaml diff --git a/docs/user-guide/reference/configuration/environment-variables.md b/docs/user-guide/reference/configuration/environment-variables.md index 5da09b97c3..8cd4047e68 100644 --- a/docs/user-guide/reference/configuration/environment-variables.md +++ b/docs/user-guide/reference/configuration/environment-variables.md @@ -1,6 +1,14 @@ # Environment Variables -For the `numa` container of vertex pods, environment variable `NUMAFLOW_DEBUG` can be set to `true` for [debugging](../../../development/debugging.md). +## Log level control + +Numaflow exposes three env vars for controlling log verbosity across its pods: + +- `NUMAFLOW_LOG_LEVEL` — sets the log level (`debug`, `info`, `warn`, `error`) for Numaflow-owned components. Overrides the level implied by `NUMAFLOW_DEBUG`. +- `RUST_LOG` — advanced override for data-plane pods (vertex `numa` container, MonoVertex `numa` container, serving pods). Accepts standard [`tracing-subscriber` EnvFilter syntax](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html) (e.g. `warn`, `numaflow_core=debug,info`) and takes precedence over `NUMAFLOW_LOG_LEVEL`. +- `NUMAFLOW_DEBUG` — development shortcut; sets level to `debug` and may switch log output from JSON to human-readable text. **Note:** the format change may break log shippers expecting JSON — prefer `NUMAFLOW_LOG_LEVEL` when only the level needs changing. + +See [Log Levels](log-levels.md) for a full pod inventory, per-component YAML examples, and common recipes. In [`udf`](../../user-defined-functions/map/map.md), [`udsink`](../../sinks/user-defined-sinks.md) and [`transformer`](../../sources/transformer/overview.md) containers, there are some preset environment variables that can be used directly. diff --git a/docs/user-guide/reference/configuration/log-levels.md b/docs/user-guide/reference/configuration/log-levels.md new file mode 100644 index 0000000000..14f7a8eeda --- /dev/null +++ b/docs/user-guide/reference/configuration/log-levels.md @@ -0,0 +1,185 @@ +# Log Levels + +Numaflow-owned pods use `NUMAFLOW_LOG_LEVEL` as the standard log-level control. Data-plane pods also support `RUST_LOG` for advanced filtering. + +## Quick reference + +| Pod / container | Standard log-level env var | Default level | +|---|---|---| +| Pipeline daemon | `NUMAFLOW_LOG_LEVEL` | `info` | +| MonoVertex daemon | `NUMAFLOW_LOG_LEVEL` | `info` | +| ISB svc create / delete job | `NUMAFLOW_LOG_LEVEL` | `info` | +| ISB svc validate (init container) | `NUMAFLOW_LOG_LEVEL` | `info` | +| Controller (`numaflow-controller`) | `NUMAFLOW_LOG_LEVEL` | `info` | +| Webhook (`numaflow-webhook`) | `NUMAFLOW_LOG_LEVEL` | `info` | +| UX server (`numaflow-server`) | `NUMAFLOW_LOG_LEVEL` | `info` | +| Pipeline vertex `numa` container | `NUMAFLOW_LOG_LEVEL` | `info` | +| MonoVertex `numa` container | `NUMAFLOW_LOG_LEVEL` | `info` | +| Serving pod | `NUMAFLOW_LOG_LEVEL` | `info` | +| InterStepBufferService (JetStream / Redis) | n/a | n/a | + +`NUMAFLOW_DEBUG=true` is also supported as a development shortcut (see [below](#numaflow_debug-interaction)). + +--- + +## Standard log levels — `NUMAFLOW_LOG_LEVEL` + +Numaflow-owned components read `NUMAFLOW_LOG_LEVEL` at startup. + +**Accepted values:** `debug`, `info`, `warn`, `error`. + +**Default:** `info` + +**Precedence:** `NUMAFLOW_LOG_LEVEL` overrides the level implied by `NUMAFLOW_DEBUG`. Invalid values fall back to the level selected by `NUMAFLOW_DEBUG` or the default. For data-plane pods, `RUST_LOG` takes precedence over `NUMAFLOW_LOG_LEVEL` when set. + +### Pipeline daemon pod + +```yaml +apiVersion: numaflow.numaproj.io/v1alpha1 +kind: Pipeline +spec: + templates: + daemon: + containerTemplate: + env: + - name: NUMAFLOW_LOG_LEVEL + value: warn +``` + +### MonoVertex daemon pod + +```yaml +apiVersion: numaflow.numaproj.io/v1alpha1 +kind: MonoVertex +spec: + daemonTemplate: + containerTemplate: + env: + - name: NUMAFLOW_LOG_LEVEL + value: warn +``` + +### Controller, webhook, and UX server + +These are cluster-level components deployed via the install manifests. Set `NUMAFLOW_LOG_LEVEL` directly on the relevant Deployment: + +```yaml +# numaflow-controller Deployment +env: + - name: NUMAFLOW_LOG_LEVEL + value: warn +``` + +### ISB service jobs + +Init and finalizer jobs for pipeline ISB creation/deletion/validation: + +```yaml +apiVersion: numaflow.numaproj.io/v1alpha1 +kind: Pipeline +spec: + templates: + job: + containerTemplate: + env: + - name: NUMAFLOW_LOG_LEVEL + value: warn +``` + +--- + +## Pipeline, MonoVertex, and Serving pods + +Pipeline vertex pods, MonoVertex pods, and Serving pods use `NUMAFLOW_LOG_LEVEL` for common log-level cases: + +**Accepted values:** `debug`, `info`, `warn`, `error`. + +**Default:** `info` + +### Pipeline vertex pod + +```yaml +apiVersion: numaflow.numaproj.io/v1alpha1 +kind: Pipeline +spec: + vertices: + - name: my-vertex + containerTemplate: + env: + - name: NUMAFLOW_LOG_LEVEL + value: warn +``` + +### MonoVertex pod + +```yaml +apiVersion: numaflow.numaproj.io/v1alpha1 +kind: MonoVertex +spec: + containerTemplate: + env: + - name: NUMAFLOW_LOG_LEVEL + value: warn +``` + +### Serving pod + +```yaml +apiVersion: numaflow.numaproj.io/v1alpha1 +kind: ServingPipeline +spec: + serving: + containerTemplate: + env: + - name: NUMAFLOW_LOG_LEVEL + value: warn +``` + +### Advanced data-plane filtering — `RUST_LOG` + +Data-plane pods also support standard [`tracing-subscriber` EnvFilter syntax](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html) via `RUST_LOG`. Use this only when you need fine-grained filtering. When `RUST_LOG` is set, it takes precedence over `NUMAFLOW_LOG_LEVEL`. + +For example, to enable debug logs for a specific target only: + +```yaml +# Pipeline.spec.vertices[].containerTemplate.env +- name: RUST_LOG + value: "numaflow_core=debug,info" +``` + +--- + +## `NUMAFLOW_DEBUG` interaction + +`NUMAFLOW_DEBUG=true` is a development shortcut. It lowers the default log level to `debug` and may switch log output from structured JSON to human-readable text. + +**Important:** switching from JSON to text format may break log shippers or aggregators that expect structured JSON. Prefer `NUMAFLOW_LOG_LEVEL=debug` to lower the level without changing the output format. + +`NUMAFLOW_LOG_LEVEL` overrides the level implied by `NUMAFLOW_DEBUG` without changing the format selected by `NUMAFLOW_DEBUG`. For data-plane pods, `RUST_LOG` takes precedence over `NUMAFLOW_LOG_LEVEL` when set. + +--- + +## Common recipes + +**Suppress idle-rater info noise on a MonoVertex daemon:** +```yaml +# MonoVertex.spec.daemonTemplate.containerTemplate.env +- name: NUMAFLOW_LOG_LEVEL + value: warn +``` + +**Enable debug for a single data-plane target without flooding all logs:** +```yaml +# Pipeline.spec.vertices[].containerTemplate.env +- name: RUST_LOG + value: "numaflow_core=debug,info" +``` + +**Enable full debug on a vertex pod:** +```yaml +# Pipeline.spec.vertices[].containerTemplate.env +- name: NUMAFLOW_DEBUG + value: "true" +# Note: this also switches log output from JSON to text. +# To keep JSON format while lowering the level, use NUMAFLOW_LOG_LEVEL=debug instead. +``` diff --git a/mkdocs.yml b/mkdocs.yml index 1b4ea37fcf..0815c8c93c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -125,6 +125,7 @@ nav: - user-guide/reference/configuration/container-resources.md - user-guide/reference/configuration/volumes.md - user-guide/reference/configuration/environment-variables.md + - user-guide/reference/configuration/log-levels.md - user-guide/reference/configuration/labels-and-annotations.md - user-guide/reference/configuration/init-containers.md - user-guide/reference/configuration/sidecar-containers.md diff --git a/pkg/apis/numaflow/v1alpha1/const.go b/pkg/apis/numaflow/v1alpha1/const.go index 46aaec973b..73a9d51ef2 100644 --- a/pkg/apis/numaflow/v1alpha1/const.go +++ b/pkg/apis/numaflow/v1alpha1/const.go @@ -129,6 +129,7 @@ const ( EnvLeaderElectionLeaseRenewPeriod = "NUMAFLOW_LEADER_ELECTION_LEASE_RENEW_PERIOD" EnvUDContainerType = "NUMAFLOW_UD_CONTAINER_TYPE" EnvDebug = "NUMAFLOW_DEBUG" + EnvLogLevel = "NUMAFLOW_LOG_LEVEL" EnvPPROF = "NUMAFLOW_PPROF" EnvHealthCheckDisabled = "NUMAFLOW_HEALTH_CHECK_DISABLED" EnvGRPCMaxMessageSize = "NUMAFLOW_GRPC_MAX_MESSAGE_SIZE" diff --git a/pkg/mvtxdaemon/server/service/rater/rater.go b/pkg/mvtxdaemon/server/service/rater/rater.go index 6269d10719..09bf9e73a5 100644 --- a/pkg/mvtxdaemon/server/service/rater/rater.go +++ b/pkg/mvtxdaemon/server/service/rater/rater.go @@ -270,7 +270,11 @@ func (r *Rater) getPodMetrics(podName string) map[string]*dto.MetricFamily { func (r *Rater) getPodReadCounts(podName string, result map[string]*dto.MetricFamily) *PodMetricsCount { value, ok := result[monoVtxReadMetricName] if !ok || value == nil || len(value.GetMetric()) == 0 { - r.log.Infof("[Pod name %s]: Metric %q is unavailable, the pod might haven't started processing data", podName, monoVtxReadMetricName) + // Logged at debug because this fires on every rater tick when the queue is idle (no messages + // ever read since pod start), since the Rust prometheus client only registers a counter after + // it is first incremented. + // To suppress, set NUMAFLOW_LOG_LEVEL=warn on the daemon container. + r.log.Debugf("[Pod name %s]: Metric %q is unavailable, the pod might haven't started processing data", podName, monoVtxReadMetricName) return nil } @@ -293,7 +297,9 @@ func (r *Rater) getPodPendingCounts(podName string, result map[string]*dto.Metri podPendingCount := &PodMetricsCount{podName, metricsList[0].Gauge.GetValue()} return podPendingCount } else { - r.log.Infof("[Pod name %s]: Metric %q is unavailable, the pod might haven't started processing data", podName, monoVtxPendingRawMetric) + // Same rationale as getPodReadCounts: gauge may not be emitted yet on an idle pod. + // To suppress, set NUMAFLOW_LOG_LEVEL=warn on the daemon container. + r.log.Debugf("[Pod name %s]: Metric %q is unavailable, the pod might haven't started processing data", podName, monoVtxPendingRawMetric) } return nil } diff --git a/pkg/shared/logging/log.go b/pkg/shared/logging/log.go index cc1226d20e..5c118c32fa 100644 --- a/pkg/shared/logging/log.go +++ b/pkg/shared/logging/log.go @@ -18,22 +18,41 @@ package logging import ( "context" + "fmt" "os" + "strings" zap "go.uber.org/zap" "go.uber.org/zap/zapcore" ) -// NewLogger returns a new zap.SugaredLogger +const ( + envDebug = "NUMAFLOW_DEBUG" + envLogLevel = "NUMAFLOW_LOG_LEVEL" +) + +// NewLogger returns a new zap.SugaredLogger. +// Log level can be overridden at runtime via the NUMAFLOW_LOG_LEVEL env var +// (accepted values: debug, info, warn, error). +// NUMAFLOW_DEBUG=true selects the development preset (console encoder, debug level). +// NUMAFLOW_LOG_LEVEL overrides the level chosen by NUMAFLOW_DEBUG. func NewLogger() *zap.SugaredLogger { var config zap.Config - debugMode, ok := os.LookupEnv("NUMAFLOW_DEBUG") + debugMode, ok := os.LookupEnv(envDebug) if ok && debugMode == "true" { config = zap.NewDevelopmentConfig() } else { config = zap.NewProductionConfig() } - // Config customization goes here if any + // NUMAFLOW_LOG_LEVEL overrides the level set by the preset above. + // Invalid values fall back to the preset level so a typo does not crash the pod. + if lvlStr, ok := os.LookupEnv(envLogLevel); ok && strings.TrimSpace(lvlStr) != "" { + if lvl, ok := parseLogLevel(lvlStr); ok { + config.Level = zap.NewAtomicLevelAt(lvl) + } else { + _, _ = fmt.Fprintf(os.Stderr, "invalid %s=%q, using default log level\n", envLogLevel, lvlStr) + } + } config.EncoderConfig.EncodeTime = zapcore.RFC3339NanoTimeEncoder config.OutputPaths = []string{"stdout"} logger, err := config.Build() @@ -43,6 +62,21 @@ func NewLogger() *zap.SugaredLogger { return logger.Named("numaflow").Sugar() } +func parseLogLevel(level string) (zapcore.Level, bool) { + switch strings.ToLower(strings.TrimSpace(level)) { + case "debug": + return zapcore.DebugLevel, true + case "info": + return zapcore.InfoLevel, true + case "warn": + return zapcore.WarnLevel, true + case "error": + return zapcore.ErrorLevel, true + default: + return zapcore.InfoLevel, false + } +} + type loggerKey struct{} // WithLogger returns a copy of parent context in which the diff --git a/pkg/shared/logging/log_test.go b/pkg/shared/logging/log_test.go new file mode 100644 index 0000000000..ee84cb5f91 --- /dev/null +++ b/pkg/shared/logging/log_test.go @@ -0,0 +1,100 @@ +/* +Copyright 2022 The Numaproj Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package logging + +import ( + "testing" + + "go.uber.org/zap/zapcore" +) + +func TestNewLogger_DefaultLevel(t *testing.T) { + t.Setenv("NUMAFLOW_DEBUG", "") + t.Setenv("NUMAFLOW_LOG_LEVEL", "") + logger := NewLogger() + if logger == nil { + t.Fatal("expected non-nil logger") + } +} + +func TestNewLogger_LogLevel(t *testing.T) { + tests := []struct { + envVal string + wantLevel zapcore.Level + }{ + {"debug", zapcore.DebugLevel}, + {"info", zapcore.InfoLevel}, + {"warn", zapcore.WarnLevel}, + {"WARN", zapcore.WarnLevel}, + {"error", zapcore.ErrorLevel}, + } + for _, tt := range tests { + t.Run(tt.envVal, func(t *testing.T) { + t.Setenv("NUMAFLOW_DEBUG", "") + t.Setenv("NUMAFLOW_LOG_LEVEL", tt.envVal) + logger := NewLogger() + if logger == nil { + t.Fatal("expected non-nil logger") + } + if !logger.Level().Enabled(tt.wantLevel) { + t.Errorf("expected level %v to be enabled", tt.wantLevel) + } + if tt.wantLevel != zapcore.DebugLevel && logger.Level().Enabled(zapcore.DebugLevel) { + t.Errorf("expected debug level to be disabled for %v", tt.envVal) + } + }) + } +} + +func TestNewLogger_InvalidLogLevel_FallsBackToDefault(t *testing.T) { + t.Setenv("NUMAFLOW_DEBUG", "") + t.Setenv("NUMAFLOW_LOG_LEVEL", "garbage") + logger := NewLogger() + if logger == nil { + t.Fatal("expected non-nil logger") + } + // default production config is info; debug should be disabled + if logger.Level().Enabled(zapcore.DebugLevel) { + t.Error("expected debug to be disabled on invalid NUMAFLOW_LOG_LEVEL") + } +} + +func TestNewLogger_LogLevelOverridesDebugPreset(t *testing.T) { + // NUMAFLOW_DEBUG=true sets level to debug; NUMAFLOW_LOG_LEVEL=warn should win + t.Setenv("NUMAFLOW_DEBUG", "true") + t.Setenv("NUMAFLOW_LOG_LEVEL", "warn") + logger := NewLogger() + if logger == nil { + t.Fatal("expected non-nil logger") + } + if logger.Level().Enabled(zapcore.InfoLevel) { + t.Error("expected info to be disabled when NUMAFLOW_LOG_LEVEL=warn overrides NUMAFLOW_DEBUG=true") + } + if !logger.Level().Enabled(zapcore.WarnLevel) { + t.Error("expected warn to be enabled") + } +} + +func TestParseLogLevelRejectsRuntimeSpecificLevels(t *testing.T) { + for _, level := range []string{"dpanic", "panic", "fatal"} { + t.Run(level, func(t *testing.T) { + if _, ok := parseLogLevel(level); ok { + t.Fatalf("expected %q to be rejected", level) + } + }) + } +} diff --git a/rust/numaflow/src/setup_tracing.rs b/rust/numaflow/src/setup_tracing.rs index 7a19d8ad1f..028bee0e6a 100644 --- a/rust/numaflow/src/setup_tracing.rs +++ b/rust/numaflow/src/setup_tracing.rs @@ -7,6 +7,11 @@ use tracing_subscriber::{Layer, filter::EnvFilter, fmt}; use std::backtrace::{Backtrace, BacktraceStatus}; use std::panic::PanicHookInfo; +const DEFAULT_LOG_LEVEL: &str = "info"; +const DEBUG_LOG_LEVEL: &str = "debug,h2::codec=info"; +const NUMAFLOW_LOG_LEVEL_ENV: &str = "NUMAFLOW_LOG_LEVEL"; +const RUST_LOG_ENV: &str = "RUST_LOG"; + /// Panic hook to send panic info to `tracing` instead of stderr. /// Without this, a panic will be logged to stderr as: /// ``` @@ -196,19 +201,55 @@ impl Drop for TracerProviderGuard { } } +fn parse_numaflow_log_level(level: &str) -> Option<&'static str> { + match level.trim().to_ascii_lowercase().as_str() { + "debug" => Some("debug"), + "info" => Some("info"), + "warn" => Some("warn"), + "error" => Some("error"), + _ => None, + } +} + +fn default_log_directive( + debug_mode: bool, + numaflow_log_level: Option<&str>, + rust_log_set: bool, +) -> &'static str { + if rust_log_set { + return DEFAULT_LOG_LEVEL; + } + + if let Some(level) = numaflow_log_level + && !level.trim().is_empty() + { + if let Some(parsed_level) = parse_numaflow_log_level(level) { + return parsed_level; + } + eprintln!( + "[setup_tracing] Invalid {NUMAFLOW_LOG_LEVEL_ENV}='{level}', using default log level" + ); + } + + if debug_mode { + DEBUG_LOG_LEVEL + } else { + DEFAULT_LOG_LEVEL + } +} + /// Initialize the tracing subscriber with optional OTLP export. /// Returns a `TracerProviderGuard` that will flush buffered spans on drop. /// Callers must bind it (e.g., `let _guard = register();`) rather than /// discard it, or the provider will shut down immediately. pub fn register() -> TracerProviderGuard { let debug_mode = std::env::var("NUMAFLOW_DEBUG").is_ok_and(|v| v.to_lowercase() == "true"); - let default_log_level = if debug_mode { - "debug,h2::codec=info" // "h2::codec" is too noisy - } else { - "info" - }; + let rust_log_set = std::env::var(RUST_LOG_ENV).is_ok_and(|v| !v.trim().is_empty()); + let numaflow_log_level = std::env::var(NUMAFLOW_LOG_LEVEL_ENV).ok(); + let default_log_level = + default_log_directive(debug_mode, numaflow_log_level.as_deref(), rust_log_set); - // Build filtering from default directives and allow `RUST_LOG` environment variable to override. + // Build filtering from Numaflow defaults and allow `RUST_LOG` to override with EnvFilter syntax. let filter = EnvFilter::builder() .with_default_directive(default_log_level.parse().unwrap_or(Level::INFO.into())) .from_env_lossy(); @@ -303,6 +344,44 @@ mod tests { )) } + #[test] + fn default_log_directive_uses_info_by_default() { + assert_eq!(default_log_directive(false, None, false), "info"); + } + + #[test] + fn default_log_directive_uses_debug_mode_default() { + assert_eq!( + default_log_directive(true, None, false), + "debug,h2::codec=info" + ); + } + + #[test] + fn default_log_directive_uses_numaflow_log_level() { + assert_eq!(default_log_directive(false, Some("warn"), false), "warn"); + assert_eq!(default_log_directive(false, Some("ERROR"), false), "error"); + } + + #[test] + fn default_log_directive_numaflow_log_level_overrides_debug() { + assert_eq!(default_log_directive(true, Some("warn"), false), "warn"); + } + + #[test] + fn default_log_directive_invalid_numaflow_log_level_falls_back() { + assert_eq!(default_log_directive(false, Some("verbose"), false), "info"); + assert_eq!( + default_log_directive(true, Some("verbose"), false), + "debug,h2::codec=info" + ); + } + + #[test] + fn default_log_directive_rust_log_takes_precedence() { + assert_eq!(default_log_directive(true, Some("warn"), true), "info"); + } + #[test] fn sampler_always_on() { assert!(matches!(