Operations

Observability & Operations

Every answer has an audit trail: context, retrieval, and output.
Latency and safety metrics are first-class dashboards.
Incident playbooks define rollback and safe mode steps.

Request Tracing

  • Request ID tied to user, job, tenant, and model call.
  • Logs include retrieval IDs, citation count, and output hash.
  • Trace spans cover context assembly, retrieval, and inference.
Decision
Auditability is required for trust and post-incident review.

Metrics That Matter

  • P95 response latency
  • Citation coverage rate
  • Abstention rate (by risk category)
  • Eval pass rate by release
  • Cost per session (via AI Gateway)
Tradeoff
We track fewer metrics, but make them actionable and tied to product decisions.

Incident Playbook (Condensed)

  1. Detect anomaly (latency spike, unsafe output, or retrieval failure).
  2. Flip to safe mode (abstain without evidence; disable risky tools).
  3. Roll back prompt or retrieval changes.
  4. Postmortem with evidence logs and remediation steps.
Risk
Without safe mode, the only recovery path is a full outage.