All Projects

AI-Augmented SRE Workflows

Built reusable Claude/Codex skill files at GoGuardian that automated vulnerability analysis and DDoS alert investigation — cutting 60+ minutes of manual security analysis to under 5 minutes per run.

Tech Stack
AI/LLMClaudeCodexPythonAWS AthenaDatadog

AI-Augmented SRE Workflows

Two recurring tasks were consuming hours of SRE time each week: manually reviewing Prisma Cloud vulnerability reports and investigating rate-limiting alerts by querying CloudFront and ALB logs. Both were analytical, multi-step, and followed a consistent logic — good candidates for automation. But neither fit neatly into a traditional script.

Why Skill Files, Not Scripts

Traditional scripts work well when the steps are deterministic. Vulnerability analysis isn’t: different CVE categories need different remediation paths, reports vary in format, and the output needs to be readable by a non-expert. A script would need hundreds of conditional branches and would still miss edge cases.

Skill files are structured instruction sets for AI CLI agents (Claude/Codex). Instead of encoding every branch, the skill describes the domain logic: what to look for, how to categorize, what format to output. The agent reasons through the specifics.

The Two Skills

Vulnerability Analysis Skill

  • Input: Prisma Cloud vulnerability report for the EC2 fleet
  • Agent categorizes vulnerabilities by severity, determines fix method (yum upgrade vs manual patching), identifies packages involved
  • Output: structured summary of P0/P1 remaining, CVEs fixable via yum upgrade, packages per CVE
  • Time: 60 minutes manual → under 2 minutes

DDoS Alert Investigation Skill

  • Input: PagerDuty alert with rate-limiting context
  • Agent queries CloudFront logs (via Athena), correlates with ALB logs, identifies source IP, targeted endpoint, user details, attack pattern
  • Output: structured investigation report — what happened, who, from where, what was blocked
  • Time: 30+ minutes manual → under 5 minutes

Team Distribution

Packaged both skills in a shared GitHub repo with a single npx install command. Any engineer can run the investigation without understanding log schemas or Athena query syntax — the skill handles the context.

Results

MetricBeforeAfter
Vulnerability analysis time~60 min per report~2 min
Alert investigation time30+ min, inconsistent~5 min, structured output
Who can investigateSRE onlyAny engineer (self-serve)
Process consistencyVariableUniform output format

The broader outcome: this established a pattern. Before writing a new Python script for a repetitive task, we now ask whether a skill file handles it better. Skills compose; scripts accumulate.