# Agent Ops Playbook: runbook automation and on‑call drills
Production isn’t kind. The way you practice is the way you perform. This guide shows how to codify runbooks and drill incident scenarios so your agents stay stable under pressure.
## Codify runbooks as tools
- Convert manual runbooks into MCP tools with typed inputs and clear outputs.
- Example: `rollback_release`, `rotate_credentials`, `pause_capability`, each with schemas and confirmations.
## Drill on staging, measure like prod
- Simulate real incidents: model drift, quota exhaustion, API regression, malformed tool outputs.
- Track MTTA/MTTR for agent incidents; publish scorecards like any SRE team.
## Golden paths and guardrails
- Precompute “golden” outcomes for critical tasks and assert them in CI.
- Gate risky tools with approvals and budget caps; require justifications and attach evidence.
## Pager hygiene for agents
- Clear severities; who owns what capability; escalation ladders.
- One-click context bundles: latest logs, traces, last 10 invocations with redaction proofs.
## After-action learning
- Turn failures into policies, tests, and improved prompts/tools.
- Track regressions and owners; drill again until it sticks.
---
Ops is a muscle. Make it explicit, measurable, and practiced—then let your agents move fast without fear.
Agent Ops Playbook: runbook automation and on‑call drills
How to codify runbooks, drill on‑call scenarios, and keep agents stable under pressure.