Nathan Schrader

Operations

Observability & Operations

Every answer has an audit trail: context, retrieval, and output.

Latency and safety metrics are first-class dashboards.

Incident playbooks define rollback and safe mode steps.

Request Tracing

Request ID tied to user, job, tenant, and model call.
Logs include retrieval IDs, citation count, and output hash.
Trace spans cover context assembly, retrieval, and inference.

Decision

Auditability is required for trust and post-incident review.

Metrics That Matter

P95 response latency
Citation coverage rate
Abstention rate (by risk category)
Eval pass rate by release
Cost per session (via AI Gateway)

Tradeoff

We track fewer metrics, but make them actionable and tied to product decisions.

Incident Playbook (Condensed)

Detect anomaly (latency spike, unsafe output, or retrieval failure).
Flip to safe mode (abstain without evidence; disable risky tools).
Roll back prompt or retrieval changes.
Postmortem with evidence logs and remediation steps.

Risk

Without safe mode, the only recovery path is a full outage.

Open questions / next steps: choose the alerting thresholds and define ownership for eval regression triage.