AI as an SRE Tool: Beyond the Hype

Most AI-in-DevOps content is either breathless hype (“AI will replace SREs!”) or dismissive (“it just hallucinates”). Neither maps to my actual experience using AI as a daily SRE tool. Here’s what I’ve found actually works, what doesn’t, and the specific architecture that made AI useful for us.

The Problem We Were Solving

Two recurring tasks were eating hours each week:

Vulnerability analysis: Prisma Cloud generates reports showing security vulnerabilities across our EC2 fleet. For each CVE, you need to determine severity, affected packages, whether it’s fixable via yum upgrade or requires manual intervention, and produce a summary that’s actionable for the team. Start to finish: about an hour per report.

DDoS alert investigation: When a rate-limiting alert fires, someone needs to query CloudFront logs in Athena, correlate with ALB logs, identify the source IP and targeted endpoint, and produce a summary of what happened. Typically 30+ minutes, inconsistent output depending on who ran it.

Both tasks shared a structure: multi-step analysis, variable input formats, reasoning-heavy output. Classic candidates for automation — except writing a traditional Python script for either would require hundreds of conditional branches and would still break whenever the input format changed slightly.

Why Scripts Weren’t the Right Tool

Here’s the problem with scripting these tasks: the logic isn’t deterministic.

A vulnerability analysis script needs to handle: different CVE categories with different remediation approaches, Prisma report format variations, edge cases like CVEs with no available fix, packages that appear multiple times across different CVEs. You can write that in Python. You’ll spend a week on it, it’ll be 500 lines, and it’ll break the next time Prisma changes their output format.

The human doing this task doesn’t use 500 conditional branches. They use judgment: “this is a kernel CVE, it needs a reboot; this is a library CVE, yum upgrade handles it; this one has no patch yet, log it and move on.” That’s pattern recognition over variable input — exactly what language models are good at.

The Skill File Approach

We use Claude and Codex via CLI agents. A “skill file” is a structured markdown document that tells the agent how to approach a specific task:

Domain context (what is Prisma Cloud, what does the report format look like)
Analysis logic (how to categorize CVE types, what remediation path each implies)
Output format (what the summary should contain, how it should be structured)

The agent reads the skill file, reads the input (report or alert), reasons through the analysis steps, and produces structured output.

For vulnerability analysis, the skill file describes the Prisma report schema, the CVE severity taxonomy we use internally, the distinction between yum upgrade-fixable and manually-fixable CVEs, and the output format the team expects. The agent handles the variable specifics of each report.

The result: 60 minutes → under 2 minutes. Consistent output every time.

For alert investigation, the skill tells the agent how to query Athena for CloudFront logs, what correlations to look for in ALB logs, how to identify attack patterns vs. legitimate traffic spikes, and what the investigation report should contain. 30+ minutes → under 5 minutes, with more complete output than the manual version.

What Makes This Work (And When It Doesn’t)

It works well when:

The input format is semi-structured but variable (JSON with optional fields, logs with parsing edge cases)
The analysis requires reasoning across multiple inputs (correlating two log sources)
The output needs to be human-readable and actionable, not just data transformation
The task happens regularly enough to justify encoding the logic, but not frequently enough to justify a full pipeline

It doesn’t work well when:

You need guaranteed deterministic output (use a script)
The task has strict latency requirements (LLM API calls have variable latency)
The input contains sensitive data you can’t send to an external API (use self-hosted models or on-prem tooling)
The logic is actually simple and a script would be cleaner (don’t over-engineer)

We use skill files for human-in-the-loop workflows where the output is reviewed before action. We still use Python scripts for fully automated pipelines where the output directly triggers system changes.

Team Distribution

The skill files live in a shared GitHub repo. Any engineer installs with a single npx command — no CLI expertise required, no understanding of the underlying log schema needed. They run the skill, get the investigation report, escalate or close based on the output.

This is the real leverage: not that I can analyze a vulnerability report in 2 minutes instead of 60, but that anyone on the team can do it without needing to know Athena query syntax or Prisma’s CVE output format. Self-service investigation removes the SRE as a bottleneck.

The Broader Pattern

After building these two skills, we applied the same evaluation framework to other repetitive tasks: “Is this task analytical reasoning over variable input? If yes, consider a skill file before writing a script.”

Not everything qualifies. But for the tasks that do, the combination of reduced time-to-output and improved accessibility to non-experts is significant.

AI isn’t replacing SRE work. It’s making specific classes of repetitive analytical work fast enough to not be the bottleneck.

If you’re exploring this approach: start with a task you do manually and frequently, where the analysis logic is consistent but the input varies. Write the skill file by explaining your own thought process — what you look for, in what order, and what you do with each finding. The agent will follow the same reasoning.