Observability Runbook

Observability Goals

Production observability should verify more than process liveness:

HTTP/gRPC/API requests are being handled.
PostgreSQL and Redis are available.
Login, 2FA, OAuth2, WebAuthn, and email codes work.
WebSocket and room realtime sync work.
Provider, proxy, slice cache, and livestream paths do not show sustained failures.
Cluster registration, event publish, and catch-up are healthy.

Main service readiness:

curl -fsS http://localhost:8080/health/ready

Recommendations:

Use readiness semantics for readiness probes, not only TCP checks.
Keep liveness probes conservative to avoid restart loops during transient dependency issues.
During rolling updates, readiness should fail quickly after shutdown begins so traffic stops entering old pods.

Enable metrics:

metrics:
  enabled: true
  host: "0.0.0.0"
  port: 9090
  auth:
    mode: "bearer_token"
    bearer_token_file: "/run/secrets/metrics_token"

Scrape test:

curl -fsS \
  -H "Authorization: Bearer $METRICS_TOKEN" \
  http://localhost:9090/metrics

Production requirements:

Do not expose metrics publicly.
In Kubernetes, prefer ServiceMonitor, VMServiceScrape, or controlled scraping.
If using Kubernetes auth, make sure the scraper service account has the required permissions.

Use JSON logs in production:

logging:
  level: "info"
  format: "json"

For troubleshooting:

logging:
  level: "debug"
  filter: "synctv=debug,tower_http=info"

Notes:

trace can produce very large logs and should not stay enabled.
File log paths follow configuration path-resolution rules; avoid read-only directories.
Config display redacts secrets, but do not paste raw environment variables or values files containing real secrets into public issues.

Dependencies

PostgreSQL connectivity, pool exhaustion, Redis connectivity, Redis latency, and Redis key-prefix collisions.

Authentication

Realtime

WebSocket connections, per-user/per-room limits, message rate limiting, and reconnect spikes.

Media

Provider error rate, upstream timeouts, proxy bypass, slice-cache hit rate, Range anomalies, and livestream retries.

Alert	Possible cause	First checks
Readiness keeps failing	DB/Redis unavailable, migration issue, config error	`synctv db status`, service logs
Login failures spike	Brute force, OAuth2 callback error, wrong JWT rotation	Rate-limit logs, OAuth2 config, secret changes
Email sending fails	SMTP credentials, TLS, provider rate limit	`synctv settings test-email`, SMTP logs
WebSocket disconnect spike	Ingress timeout, rolling update, low connection limits	Ingress timeout, shutdown drain, connection limits
Provider errors spike	Upstream unavailable, header mismatch, expired credential	Provider config, proxy headers, upstream response
Redis errors	Redis restart, network issue, Sentinel misconfiguration	Redis logs, `redis.deployment_mode`, connection URL

Collect shareable information first. Do not paste secrets.

Record version, deployment method, and whether cluster mode is enabled.
Save redacted synctv config show --output yaml.
Save recent service logs and restart events.
Save /health/ready and /metrics availability results.
Save synctv db status output.
For Kubernetes, save kubectl describe pod, kubectl get ingress,svc,pod, and rollout status.
For media issues, record provider type, direct/proxy mode, whether Range was used, and upstream HTTP status.

kubectl -n synctv get pod -o wide
kubectl -n synctv describe pod <pod>
kubectl -n synctv logs <pod> --tail=200

kubectl -n synctv get svc,ingress
kubectl -n synctv describe ingress synctv
kubectl -n synctv describe ingress synctv-grpc

kubectl -n synctv rollout status deploy/synctv
kubectl -n synctv rollout history deploy/synctv