Agentic Runbooks: what they are, how to write them, and who's shipping them in 2026
The independent reference for the phrase the industry is still learning to spell. Vendor-neutral, real YAML examples, forensic tool comparison, and a free ROI calculator.
What does an agentic runbook actually look like?
Here is a minimal agentic runbook YAML and the agent's reasoning trace for a Kubernetes pod crash-loop at 3am. No hypotheticals.
# agentic-runbook: pod-crashloop-remediation
metadata:
id: k8s-crashloop-v2
owner: platform-eng
risk: medium
approvers: [on-call-lead]
last_verified: 2026-04-01
signal_spec:
trigger: pagerduty_alert
condition: "alert.title contains CrashLoopBackOff"
cooldown_minutes: 5
tool_scope:
- kubectl_get_pod_logs
- kubectl_describe_pod
- kubectl_rollout_restart
- pagerduty_acknowledge
- slack_post_update
action_boundary:
auto_approve:
- kubectl_get_pod_logs
- kubectl_describe_pod
- slack_post_update
require_human:
- kubectl_rollout_restart # writes to prod
execution_plan:
framework: langgraph
model: claude-sonnet-4-5
max_iterations: 8Agent reasoning trace
The 60-second definition
A runbook is a set of instructions for handling an incident. An automated runbook executes those instructions on a trigger, via tools like Rundeck or Ansible. An agentic runbook goes further: the execution is handled by an AI agent that reads signals, reasons about what to do, calls real tools, and learns from the outcome. The agent is not following a fixed script. It is applying judgment.
Reasons over signals
Not just 'alert fired, run script'. The agent observes CPU spikes, log patterns, dependency health, and recent deploys before choosing an action.
Chooses from a tool scope
The agent has a defined inventory of actions it can take. It selects the right tool for the situation, not the next step in a fixed list.
Learns from outcomes
After resolution, the agent updates its runbook library via a learning loop. The next similar incident takes less time.
Traditional vs AI-assisted vs Agentic
The terminology is muddled. Here is the clean taxonomy. Full comparison at /traditional-vs-agentic.
| Dimension | Traditional | Automated | Agentic |
|---|---|---|---|
| Format | Confluence doc / PDF | Script / Ansible playbook | YAML + LangGraph / AutoGen |
| Trigger | Human reads alert | Webhook / cron | Observability signal + LLM reasoning |
| Execution | Human follows steps | Deterministic script | Agent chooses actions from scope |
| Adaptability | None | Low (pre-scripted paths) | High (reasons about novel situations) |
| Learning | Postmortem updates doc | None | Outcome feeds learning loop |
| Audit trail | Slack thread + notes | Script log | Full reasoning trace + tool calls |
| Typical tool | Confluence, Notion | Rundeck, Ansible | PagerDuty AIOps, Rootly, Kubiya |
Who's shipping this in 2026?
Full matrix with pricingPagerDuty
Runbook Automation + AIOps
$125/user/mo
incident.io
AI workflows, Slack-native
Custom
FireHydrant
AI-assisted runbooks
Custom
Rootly
AI postmortem + RCA
Custom
Shoreline
Notebooks, 75% MTTR claim
Custom
Kubiya
Meta-agent orchestration
Custom
Komodor Klaudia
Kubernetes-focused, 95% accuracy
Custom
AWS DevOps Agent
Bedrock AgentCore + MCP
Usage-based
Pricing from vendor public pages, April 2026. Verify before procurement. See all 12 vendors including Traversal, Resolve.ai, Datadog Bits AI, xMatters, and OpenSRE.
What agentic runbooks are actually doing in production
All 12 use casesPod crash-loop remediation
Agent detects CrashLoopBackOff, reads logs, proposes restart, gets approval. 23-second MTTR.
Deployment rollback
Error-rate spike triggers agent to diff recent deploys, propose rollback to last stable version.
Certificate expiry rotation
Proactive agent runs nightly, detects certs expiring in 14 days, initiates rotation workflow.
Cost anomaly scale-down
Cloud cost spike triggers agent to find over-provisioned resources and propose scale-down.
Auth spike response
Login volume 10x normal: agent classifies campaign vs DDoS, routes to appropriate runbook.
Noise suppression
PagerDuty AIOps agent correlates 400 alerts into 3 actionable incidents. 91% reduction claimed.
The risks nobody wants to put in the pitch deck
You just gave an LLM kubectl write access and a webhook trigger. Here is what the threat model looks like.
Prompt injection via alert payloads
An attacker crafts a pod name or service response that hijacks the agent's instructions mid-execution.
Over-privileged IAM
IBM research: 70% of orgs grant AI more access than equivalent humans. Those orgs see 4.5x more security incidents.
Destructive action blast radius
kubectl delete, terraform destroy, and misconfigured rollbacks can cascade. Circuit breakers are not optional.
What is your MTTR savings worth?
Vendor MTTR reduction claims range from 38% to 95%. The free ROI calculator lets SRE leads model their own team's numbers, no email required.
Common questions
What is an agentic runbook?+
An agentic runbook is a runbook executed by an AI agent that reasons over live signals, chooses actions from a defined tool scope, and learns from outcomes. The three defining properties are agency, memory, and tool scope. Unlike a scripted automated runbook, the agent is not following a fixed execution path. It applies judgment to the current state of the system.
Is PagerDuty's runbook automation really agentic?+
Honest answer: partially. PagerDuty Runbook Automation (formerly Rundeck, $125/user/month) has deterministic execution at its core: event triggers a job, job runs predefined steps. The recent additions of Gen-AI job authoring and the AIOps event-correlation layer push it toward agentic behaviour, but the runbook execution itself remains deterministic. The AIOps layer is agentic-adjacent; the runbook runner is not.
How do you write an agentic runbook?+
An agentic runbook needs eight fields: metadata (id, version, owner, risk, approvers), signal_spec (what triggers the agent), tool_scope (what APIs it can call), action_boundary (which actions require human approval), context_retrieval (what past incidents and docs the agent pulls via RAG), execution_plan (the LangGraph or AutoGen graph), observability (logs and reasoning dump), and a learning_loop (how outcomes feed back). The /writing-your-first-agentic-runbook page has three full working examples in YAML and LangGraph Python.
What is MCP and why does it matter for runbooks?+
Model Context Protocol is Anthropic's open standard for agent-to-tool and agent-to-agent communication. It standardises how an agent discovers and invokes capabilities. AWS Bedrock AgentCore wraps Kubernetes, logs, and metrics APIs as MCP tools, meaning a LangGraph or AutoGen agent can call kubectl, CloudWatch, and PagerDuty through a single interface. For runbooks, MCP simplifies integration and enables composable, vendor-neutral agent architectures.
Can an agentic runbook work in an air-gapped environment?+
Yes, with caveats. Cloud-API LLMs like Claude Sonnet or GPT-4o are out unless your compliance allows egress. The viable paths are self-hosted models (Llama 3, Mixtral, or fine-tuned smaller models) with a local vector database and on-premise orchestration. Emerging option: domain-specific distilled models fine-tuned on runbook reasoning tasks, running entirely inside your VPC. Latency and capability will be lower than cloud LLMs, but the pattern is architecturally sound.
Will agentic runbooks still be relevant in five years?+
The vocabulary may shift (as 'DevOps' became 'platform engineering'), but the underlying pattern is durable: AI-mediated operational reasoning sitting between observability signals and infrastructure action planes. The specific term 'agentic runbook' may not survive, but the category it describes will. The sites and practitioners who define the vocabulary now will carry that authority forward, regardless of what the term evolves into.
Explore the reference
What is an Agentic Runbook?
Precise definition, taxonomy, and the four distinguishing properties.
Traditional vs Agentic
Side-by-side comparison matrix and decision tree.
Compare 12 Vendors
Forensic capability matrix. No sponsored placements.
Write Your First
Three working examples in YAML and LangGraph Python.
Security Threat Model
Prompt injection, over-privileged IAM, blast radius, and mitigations.
For Kubernetes
The 10 most automated K8s incident patterns, with real tools.
For AWS
DevOps Agent, Bedrock AgentCore, MCP gateway, and IAM policy.
Postmortem Automation
AI-drafted postmortems: what they produce and where they fail.
Glossary (40 terms)
The SRE and agentic AI vocabulary, defined precisely.