Observability & Operations
Every answer has an audit trail: context, retrieval, and output.
Latency and safety metrics are first-class dashboards.
Incident playbooks define rollback and safe mode steps.
Request Tracing
- Request ID tied to user, job, tenant, and model call.
- Logs include retrieval IDs, citation count, and output hash.
- Trace spans cover context assembly, retrieval, and inference.
Decision
Auditability is required for trust and post-incident review.
Metrics That Matter
- P95 response latency
- Citation coverage rate
- Abstention rate (by risk category)
- Eval pass rate by release
- Cost per session (via AI Gateway)
Tradeoff
We track fewer metrics, but make them actionable and tied to product decisions.
Incident Playbook (Condensed)
- Detect anomaly (latency spike, unsafe output, or retrieval failure).
- Flip to safe mode (abstain without evidence; disable risky tools).
- Roll back prompt or retrieval changes.
- Postmortem with evidence logs and remediation steps.
Risk
Without safe mode, the only recovery path is a full outage.