Skip to content

Observability Runbook

Production observability should verify more than process liveness:

  • HTTP/gRPC/API requests are being handled.
  • PostgreSQL and Redis are available.
  • Login, 2FA, OAuth2, WebAuthn, and email codes work.
  • WebSocket and room realtime sync work.
  • Provider, proxy, slice cache, and livestream paths do not show sustained failures.
  • Cluster registration, event publish, and catch-up are healthy.

Main service readiness:

Terminal window
curl -fsS http://localhost:8080/health/ready

Recommendations:

  • Use readiness semantics for readiness probes, not only TCP checks.
  • Keep liveness probes conservative to avoid restart loops during transient dependency issues.
  • During rolling updates, readiness should fail quickly after shutdown begins so traffic stops entering old pods.

Enable metrics:

metrics:
enabled: true
host: "0.0.0.0"
port: 9090
auth:
mode: "bearer_token"
bearer_token_file: "/run/secrets/metrics_token"

Scrape test:

Terminal window
curl -fsS \
-H "Authorization: Bearer $METRICS_TOKEN" \
http://localhost:9090/metrics

Production requirements:

  • Do not expose metrics publicly.
  • In Kubernetes, prefer ServiceMonitor, VMServiceScrape, or controlled scraping.
  • If using Kubernetes auth, make sure the scraper service account has the required permissions.

Use JSON logs in production:

logging:
level: "info"
format: "json"

For troubleshooting:

logging:
level: "debug"
filter: "synctv=debug,tower_http=info"

Notes:

  • trace can produce very large logs and should not stay enabled.
  • File log paths follow configuration path-resolution rules; avoid read-only directories.
  • Config display redacts secrets, but do not paste raw environment variables or values files containing real secrets into public issues.

Dependencies

PostgreSQL connectivity, pool exhaustion, Redis connectivity, Redis latency, and Redis key-prefix collisions.

Authentication

Login failures, MFA failures, email send failures, OAuth2 state errors, and brute-force lockouts.

Realtime

WebSocket connections, per-user/per-room limits, message rate limiting, and reconnect spikes.

Media

Provider error rate, upstream timeouts, proxy bypass, slice-cache hit rate, Range anomalies, and livestream retries.

AlertPossible causeFirst checks
Readiness keeps failingDB/Redis unavailable, migration issue, config errorsynctv db status, service logs
Login failures spikeBrute force, OAuth2 callback error, wrong JWT rotationRate-limit logs, OAuth2 config, secret changes
Email sending failsSMTP credentials, TLS, provider rate limitsynctv settings test-email, SMTP logs
WebSocket disconnect spikeIngress timeout, rolling update, low connection limitsIngress timeout, shutdown drain, connection limits
Provider errors spikeUpstream unavailable, header mismatch, expired credentialProvider config, proxy headers, upstream response
Redis errorsRedis restart, network issue, Sentinel misconfigurationRedis logs, redis.deployment_mode, connection URL

Collect shareable information first. Do not paste secrets.

  1. Record version, deployment method, and whether cluster mode is enabled.
  2. Save redacted synctv config show --output yaml.
  3. Save recent service logs and restart events.
  4. Save /health/ready and /metrics availability results.
  5. Save synctv db status output.
  6. For Kubernetes, save kubectl describe pod, kubectl get ingress,svc,pod, and rollout status.
  7. For media issues, record provider type, direct/proxy mode, whether Range was used, and upstream HTTP status.
Terminal window
kubectl -n synctv get pod -o wide
kubectl -n synctv describe pod <pod>
kubectl -n synctv logs <pod> --tail=200