Monitoring

Purpose

Describe how Get2Dial services expose health and metrics, and what to watch.

Overview

The control plane exposes three unauthenticated endpoints:

GET /healthz — liveness.
GET /readyz — readiness; aggregates the Postgres, Redis and NATS health checkers registered at boot.
GET /metrics — Prometheus metrics (the middleware chain is CORS → request-logger with trace ID → metrics instrumentation → mux).

There is also GET /api/v1/edge/health, a public aggregator that probes the edge stack (control, callengine, nodeagent, OpenSIPS, rtpengine, FreeSWITCH) via the EDGE_HEALTH_* URLs.

Configuration

Container health checks hit the served endpoints, e.g.:

healthcheck:
  test: ["CMD", "wget", "--spider", "--quiet", "http://localhost:8080/healthz"]
  interval: 10s
  timeout: 5s
  retries: 5

Set LOG_LEVEL (default info) and LOG_FORMAT (default json).

Examples

Key signals to alert on:

/readyz failing — a dependency (Postgres/Redis/NATS) is down.
Dialer pacing stalls (no originates while leads remain) — often a NODE_ID mismatch between pacer target and edge subscription.
NATS disconnects between control plane and an edge.
Abandonment rate approaching a campaign’s abandonment_rate_cap.

Notes

A trace ID is propagated end to end so a single call can be followed across services.
Treat the migration version as a first-class health signal after upgrades.