-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
Add OpenTelemetry-based observability to Newt
Reference: fosrl/gerbil#25
Summary / Goal
Instrument Newt with OpenTelemetry Metrics (OTel) following CNCF / industry standards so that:
- Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API).
- Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
- Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
- Focus is metrics first; design should allow adding traces and logs later.
- Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.
Why OpenTelemetry (OTel)
- OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
- Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
- OTel Collector enables enrichment, normalization, batching, and flexible export pipelines (OTLP, remote_write).
Requirements & Constraints
- Use the OpenTelemetry Go SDK (modules) and follow OTel semantic conventions for relevant signals (HTTP, RPC, network).
- Provide a /metrics endpoint in Prometheus exposition format via the OTel Prometheus exporter.
- All durations in seconds and sizes in bytes. Metric names should carry units where applicable (_seconds, _bytes) and counters use _total.
- Labels must be low-cardinality and stable (e.g.,
site_id
,tunnel_id
,transport
). - Exporters configurable at runtime through environment variables (no code change required to switch).
- Provide an example OTel Collector config demonstrating attribute promotion and remote_write.
Recommended Newt Metrics
Category | Metric Name | Type | Labels | Units / Notes |
---|---|---|---|---|
Site / Registration | newt_site_registrations_total |
Counter | site_id , region , result |
count |
newt_site_online |
Gauge | site_id , transport |
bool (0/1) | |
newt_site_last_heartbeat_seconds |
Gauge | site_id |
seconds since last heartbeat | |
Tunnel / Sessions | newt_tunnel_sessions_total |
Gauge | site_id , tunnel_id , transport |
active sessions |
newt_tunnel_bytes_total |
Counter | site_id , tunnel_id , direction |
bytes (in/out) | |
newt_tunnel_latency_seconds |
Histogram | site_id , tunnel_id , transport |
seconds | |
newt_tunnel_reconnects_total |
Counter | site_id , tunnel_id , reason |
count | |
Connection / NAT | newt_connection_attempts_total |
Counter | site_id , transport , result |
count |
newt_connection_errors_total |
Counter | site_id , transport , error_type |
count | |
newt_nat_mapping_active |
Gauge | site_id , mapping_type |
bool/count | |
Peer / Health | newt_peer_heartbeat_latency_seconds |
Histogram | site_id , peer_id |
seconds |
newt_peer_last_handshake_seconds |
Gauge | site_id , peer_id |
seconds | |
Operational / Ops | newt_config_reloads_total |
Counter | result |
count |
newt_restart_count_total |
Counter | count | ||
Runtime | newt_go_goroutines |
Gauge | count | |
newt_go_mem_alloc_bytes |
Gauge | bytes |
Implementation Plan
-
Dependencies (example packages)
- Add OpenTelemetry Go modules to
go.mod
:go.opentelemetry.io/otel
go.opentelemetry.io/otel/sdk/metric
go.opentelemetry.io/otel/exporters/prometheus
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc
(or OTLP HTTP variant)- Optional contrib instrumentation:
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
go.opentelemetry.io/contrib/instrumentation/runtime
- ...
- Add OpenTelemetry Go modules to
-
Central metrics package
- Create
internal/metrics/
that:- Initializes OTel
MeterProvider
. - Registers Prometheus exporter (when enabled) and exposes a handler on
/metrics
(or mounts to existing server route). - Optionally registers OTLP exporter when enabled via env vars.
- Pre-registers all Newt metric instruments with names, descriptions and label keys.
- Exposes a singleton
metrics
API with helper functions:Inc(name string, labels ...attribute.KeyValue)
Observe(name string, value float64, labels ...attribute.KeyValue)
SetGauge(name string, value float64, labels ...attribute.KeyValue)
- Implements
Shutdown(ctx)
to flush and stop providers/exporters.
- Initializes OTel
- Create
-
Instrumentation approach
- Site registration & heartbeats:
- Increment registration counters and set
site_online
/site_last_heartbeat
.
- Increment registration counters and set
- Tunnels & sessions:
- Update session counts, bytes in/out, latency histograms, reconnect counters.
- Connection & NAT logic:
- Record connection attempts, successes/failures, NAT mapping states.
- Peer health & handshakes:
- Observe heartbeat latency and last handshake timestamps.
- Operational flows:
- Config reloads and restarts.
- Runtime metrics:
- Register basic Go runtime metrics (goroutines, mem) via contrib or runtime package and export them.
- Site registration & heartbeats:
-
Histograms & buckets
- Duration buckets:
[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
- Byte-size buckets:
[512, 1024, 4096, 16384, 65536, 262144, 1048576]
- Always use seconds for durations and bytes for sizes.
- Duration buckets:
-
Exporter configuration (runtime)
- Environment variables (suggested defaults):
NEWT_METRICS_PROMETHEUS_ENABLED=true
NEWT_METRICS_OTLP_ENABLED=false
OTEL_EXPORTER_OTLP_ENDPOINT
(when OTLP enabled)OTEL_EXPORTER_OTLP_PROTOCOL
(http/protobuf
orgrpc
)OTEL_SERVICE_NAME=newt
OTEL_RESOURCE_ATTRIBUTES
(e.g.,service.instance.id=...
)OTEL_METRIC_EXPORT_INTERVAL
(ms)
- Environment variables (suggested defaults):
-
Local testing
- Provide
docker-compose.metrics.yml
with:- Newt (local build)
- OpenTelemetry Collector (example config)
- Prometheus (scraping
/metrics
or scraping Collector) - Grafana (optional)
- Validate direct Prometheus scrape and OTLP → Collector → remote_write flows.
- Provide
-
Collector example
- Include
examples/collector.yaml
demonstrating:- OTLP receiver
- Transform processor to promote resource attributes (e.g.,
wg_interface
,peer
,site_id
) - Prometheus remote_write exporter (generic endpoint)
- Notes on:
- Metric name normalization for Prometheus
out_of_order_time_window
if sending OTLP to Prometheus
- Include
-
Documentation
observability.md
:- Metric catalog (name, type, labels, units, description)
- How to enable/disable Prometheus exporter and OTLP exporter via env vars
- How to run Docker Compose test stack
- How to add a new metric (naming, labels, buckets)
-
Testing & validation
- Manual test: start compose, generate traffic, curl
/metrics
, verify metrics names, units, labels and histogram buckets. - Include sample
/metrics
output in the PR. - ...
- Manual test: start compose, generate traffic, curl
Acceptance Criteria
/metrics
endpoint exposes OTel metrics in Prometheus format with correct naming and units.- Newt metrics cover site registration/heartbeats, tunnel sessions/throughput/latency, connections/NAT, peer health, certificates and operational events.
- Exporter backends can be swapped via environment variables without code changes.
- Example OTel Collector config provided and tested in local compose flow.
docs/observability.md
added with metric catalog and run instructions.
🔗 References & Best Practices
- Traefik - Metrics (observability) -- Traefik metrics configuration and exporter options.
- OpenTelemetry - Go: Getting Started / Instrumentation Guide -- How to instrument Go applications with OpenTelemetry.
- OpenTelemetry - Go: Exporters -- Exporter options for Go (OTLP, Prometheus, etc.).
Guides & integrations
- Prometheus - OpenTelemetry guide -- Guidance for integrating Prometheus with OpenTelemetry.
- Prometheus blog - Commitment to OpenTelemetry (Mar 2024) -- Prometheus project notes and recommended OTLP ingestion patterns.
Practical walkthroughs & blog posts
- OpenTelemetry blog - Prometheus + OpenTelemetry (2024) - Practical notes on combining Prometheus and OpenTelemetry.
- Grafana Blog - A practical guide to data collection with OpenTelemetry and Prometheus (Jul 2023) -- Hands-on examples and best practices for OTEL + Prometheus.
- BetterStack - OpenTelemetry for Go -- Practical guide for instrumenting Go apps with OpenTelemetry.
- BetterStack - OpenTelemetry metrics vs Prometheus metrics -- Comparison and guidance when to use OTEL vs Prometheus metric
Metadata
Metadata
Assignees
Labels
No labels