Skip to content

Monitoring

Describe how Get2Dial services expose health and metrics, and what to watch.

The control plane exposes three unauthenticated endpoints:

  • GET /healthz — liveness.
  • GET /readyz — readiness; aggregates the Postgres, Redis and NATS health checkers registered at boot.
  • GET /metrics — Prometheus metrics (the middleware chain is CORS → request-logger with trace ID → metrics instrumentation → mux).

There is also GET /api/v1/edge/health, a public aggregator that probes the edge stack (control, callengine, nodeagent, OpenSIPS, rtpengine, FreeSWITCH) via the EDGE_HEALTH_* URLs.

Container health checks hit the served endpoints, e.g.:

healthcheck:
test: ["CMD", "wget", "--spider", "--quiet", "http://localhost:8080/healthz"]
interval: 10s
timeout: 5s
retries: 5

Set LOG_LEVEL (default info) and LOG_FORMAT (default json).

Key signals to alert on:

  • /readyz failing — a dependency (Postgres/Redis/NATS) is down.
  • Dialer pacing stalls (no originates while leads remain) — often a NODE_ID mismatch between pacer target and edge subscription.
  • NATS disconnects between control plane and an edge.
  • Abandonment rate approaching a campaign’s abandonment_rate_cap.
  • A trace ID is propagated end to end so a single call can be followed across services.
  • Treat the migration version as a first-class health signal after upgrades.