Skip to content

[Feature Request] Implement OpenTelemetry Metrics in Newt #131

@marcschaeferger

Description

@marcschaeferger

Add OpenTelemetry-based observability to Newt

Reference: fosrl/gerbil#25

Summary / Goal

Instrument Newt with OpenTelemetry Metrics (OTel) following CNCF / industry standards so that:

  • Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API).
  • Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
  • Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
  • Focus is metrics first; design should allow adding traces and logs later.
  • Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.

Why OpenTelemetry (OTel)

  • OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
  • Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
  • OTel Collector enables enrichment, normalization, batching, and flexible export pipelines (OTLP, remote_write).

Requirements & Constraints

  • Use the OpenTelemetry Go SDK (modules) and follow OTel semantic conventions for relevant signals (HTTP, RPC, network).
  • Provide a /metrics endpoint in Prometheus exposition format via the OTel Prometheus exporter.
  • All durations in seconds and sizes in bytes. Metric names should carry units where applicable (_seconds, _bytes) and counters use _total.
  • Labels must be low-cardinality and stable (e.g., site_id, tunnel_id, transport).
  • Exporters configurable at runtime through environment variables (no code change required to switch).
  • Provide an example OTel Collector config demonstrating attribute promotion and remote_write.

Recommended Newt Metrics

Category Metric Name Type Labels Units / Notes
Site / Registration newt_site_registrations_total Counter site_id, region, result count
newt_site_online Gauge site_id, transport bool (0/1)
newt_site_last_heartbeat_seconds Gauge site_id seconds since last heartbeat
Tunnel / Sessions newt_tunnel_sessions_total Gauge site_id, tunnel_id, transport active sessions
newt_tunnel_bytes_total Counter site_id, tunnel_id, direction bytes (in/out)
newt_tunnel_latency_seconds Histogram site_id, tunnel_id, transport seconds
newt_tunnel_reconnects_total Counter site_id, tunnel_id, reason count
Connection / NAT newt_connection_attempts_total Counter site_id, transport, result count
newt_connection_errors_total Counter site_id, transport, error_type count
newt_nat_mapping_active Gauge site_id, mapping_type bool/count
Peer / Health newt_peer_heartbeat_latency_seconds Histogram site_id, peer_id seconds
newt_peer_last_handshake_seconds Gauge site_id, peer_id seconds
Operational / Ops newt_config_reloads_total Counter result count
newt_restart_count_total Counter count
Runtime newt_go_goroutines Gauge count
newt_go_mem_alloc_bytes Gauge bytes

Implementation Plan

  1. Dependencies (example packages)

    • Add OpenTelemetry Go modules to go.mod:
      • go.opentelemetry.io/otel
      • go.opentelemetry.io/otel/sdk/metric
      • go.opentelemetry.io/otel/exporters/prometheus
      • go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc (or OTLP HTTP variant)
      • Optional contrib instrumentation:
        • go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
        • go.opentelemetry.io/contrib/instrumentation/runtime
      • ...
  2. Central metrics package

    • Create internal/metrics/ that:
      • Initializes OTel MeterProvider.
      • Registers Prometheus exporter (when enabled) and exposes a handler on /metrics (or mounts to existing server route).
      • Optionally registers OTLP exporter when enabled via env vars.
      • Pre-registers all Newt metric instruments with names, descriptions and label keys.
      • Exposes a singleton metrics API with helper functions:
        • Inc(name string, labels ...attribute.KeyValue)
        • Observe(name string, value float64, labels ...attribute.KeyValue)
        • SetGauge(name string, value float64, labels ...attribute.KeyValue)
      • Implements Shutdown(ctx) to flush and stop providers/exporters.
  3. Instrumentation approach

    • Site registration & heartbeats:
      • Increment registration counters and set site_online/site_last_heartbeat.
    • Tunnels & sessions:
      • Update session counts, bytes in/out, latency histograms, reconnect counters.
    • Connection & NAT logic:
      • Record connection attempts, successes/failures, NAT mapping states.
    • Peer health & handshakes:
      • Observe heartbeat latency and last handshake timestamps.
    • Operational flows:
      • Config reloads and restarts.
    • Runtime metrics:
      • Register basic Go runtime metrics (goroutines, mem) via contrib or runtime package and export them.
  4. Histograms & buckets

    • Duration buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
    • Byte-size buckets: [512, 1024, 4096, 16384, 65536, 262144, 1048576]
    • Always use seconds for durations and bytes for sizes.
  5. Exporter configuration (runtime)

    • Environment variables (suggested defaults):
      • NEWT_METRICS_PROMETHEUS_ENABLED=true
      • NEWT_METRICS_OTLP_ENABLED=false
      • OTEL_EXPORTER_OTLP_ENDPOINT (when OTLP enabled)
      • OTEL_EXPORTER_OTLP_PROTOCOL (http/protobuf or grpc)
      • OTEL_SERVICE_NAME=newt
      • OTEL_RESOURCE_ATTRIBUTES (e.g., service.instance.id=...)
      • OTEL_METRIC_EXPORT_INTERVAL (ms)
  6. Local testing

    • Provide docker-compose.metrics.yml with:
      • Newt (local build)
      • OpenTelemetry Collector (example config)
      • Prometheus (scraping /metrics or scraping Collector)
      • Grafana (optional)
    • Validate direct Prometheus scrape and OTLP → Collector → remote_write flows.
  7. Collector example

    • Include examples/collector.yaml demonstrating:
      • OTLP receiver
      • Transform processor to promote resource attributes (e.g., wg_interface, peer, site_id)
      • Prometheus remote_write exporter (generic endpoint)
      • Notes on:
        • Metric name normalization for Prometheus
        • out_of_order_time_window if sending OTLP to Prometheus
  8. Documentation

    • observability.md:
      • Metric catalog (name, type, labels, units, description)
      • How to enable/disable Prometheus exporter and OTLP exporter via env vars
      • How to run Docker Compose test stack
      • How to add a new metric (naming, labels, buckets)
  9. Testing & validation

    • Manual test: start compose, generate traffic, curl /metrics, verify metrics names, units, labels and histogram buckets.
    • Include sample /metrics output in the PR.
    • ...

Acceptance Criteria

  • /metrics endpoint exposes OTel metrics in Prometheus format with correct naming and units.
  • Newt metrics cover site registration/heartbeats, tunnel sessions/throughput/latency, connections/NAT, peer health, certificates and operational events.
  • Exporter backends can be swapped via environment variables without code changes.
  • Example OTel Collector config provided and tested in local compose flow.
  • docs/observability.md added with metric catalog and run instructions.

🔗 References & Best Practices

Guides & integrations

Practical walkthroughs & blog posts

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions