# Agent Ops Playbook: SLOs, incidents, and safe rollouts
Operating AI agents in production requires more than clever prompts. This playbook distills the core practices for reliability at scale.
## Define and measure SLOs
- Error budget per capability (tool): latency, success rate, cost.
- Separate **model quality** SLOs from **integration** SLOs.
- Export metrics from MCP servers; build per-tool dashboards.
## Guard changes behind safe rollout strategies
- Canary per tool/server; progressive % ramp with automatic rollback.
- Feature flags for prompts, tool parameters, and model routes.
- Store rollout state in config—not code—to enable instant freezes.
## Incident response for agents
- Standard incident severities tied to business impact.
- Runbooks per capability: reproduce, mitigate, backfill.
- Telemetry checklists: traces, request/response samples, redaction proofs.
## Risk controls at the MCP boundary
- Confirmations for destructive tools; budget/time caps; dry‑run modes.
- Per‑tool auth scopes; ephemeral credentials with rotation.
- Policy engine that scores risk before allowing execution.
## Postmortems that actually improve reliability
- Classify by failure domain: model, tool, transport, data, policy.
- Convert learnings into tests, alerts, and explicit guardrails.
- Track repeated regressions with owners and deadlines.
---
Reliability is a product feature. Treat agent behavior like any critical service: measure, gate, test, and roll out with intent.
Agent Ops Playbook: SLOs, incidents, and safe rollouts
A field guide to SLOs, incident response, and safe rollouts for AI agents in production.