AI-Augmented SRE Workflows
Built reusable Claude/Codex skill files at GoGuardian that automated vulnerability analysis and DDoS alert investigation — cutting 60+ minutes of manual security analysis to under 5 minutes per run.
AI-Augmented SRE Workflows
Two recurring tasks were consuming hours of SRE time each week: manually reviewing Prisma Cloud vulnerability reports and investigating rate-limiting alerts by querying CloudFront and ALB logs. Both were analytical, multi-step, and followed a consistent logic — good candidates for automation. But neither fit neatly into a traditional script.
Why Skill Files, Not Scripts
Traditional scripts work well when the steps are deterministic. Vulnerability analysis isn’t: different CVE categories need different remediation paths, reports vary in format, and the output needs to be readable by a non-expert. A script would need hundreds of conditional branches and would still miss edge cases.
Skill files are structured instruction sets for AI CLI agents (Claude/Codex). Instead of encoding every branch, the skill describes the domain logic: what to look for, how to categorize, what format to output. The agent reasons through the specifics.
The Two Skills
Vulnerability Analysis Skill
- Input: Prisma Cloud vulnerability report for the EC2 fleet
- Agent categorizes vulnerabilities by severity, determines fix method (
yum upgradevs manual patching), identifies packages involved - Output: structured summary of P0/P1 remaining, CVEs fixable via
yum upgrade, packages per CVE - Time: 60 minutes manual → under 2 minutes
DDoS Alert Investigation Skill
- Input: PagerDuty alert with rate-limiting context
- Agent queries CloudFront logs (via Athena), correlates with ALB logs, identifies source IP, targeted endpoint, user details, attack pattern
- Output: structured investigation report — what happened, who, from where, what was blocked
- Time: 30+ minutes manual → under 5 minutes
Team Distribution
Packaged both skills in a shared GitHub repo with a single npx install command. Any engineer can run the investigation without understanding log schemas or Athena query syntax — the skill handles the context.
Results
| Metric | Before | After |
|---|---|---|
| Vulnerability analysis time | ~60 min per report | ~2 min |
| Alert investigation time | 30+ min, inconsistent | ~5 min, structured output |
| Who can investigate | SRE only | Any engineer (self-serve) |
| Process consistency | Variable | Uniform output format |
The broader outcome: this established a pattern. Before writing a new Python script for a repetitive task, we now ask whether a skill file handles it better. Skills compose; scripts accumulate.