Incident Report Generator

Automatically generate post-mortem reports from incident timelines and Slack conversations.

This workflow analyzes incident data, creates timelines, identifies root causes, and generates structured post-mortems.

Implementation

1import { relay } from "@relayplane/workflows";
2
3const result = await relay
4 .workflow("incident-report")
5
6 // Step 1: Parse incident timeline from multiple sources
7 .step("parse-timeline")
8 .with("openai:gpt-4o")
9 .prompt(`Parse this incident data into a structured timeline:
10
11PagerDuty Events:
12{{pagerdutyEvents}}
13
14Slack #incidents Channel:
15{{slackMessages}}
16
17Monitoring Alerts:
18{{datadogAlerts}}
19
20Create chronological timeline with:
21- Timestamp (ISO format)
22- Event description
23- Who took action
24- System component affected
25
26Return as JSON array sorted by time.`)
27
28 // Step 2: Identify root cause
29 .step("root-cause")
30 .with("anthropic:claude-3.5-sonnet")
31 .depends("parse-timeline")
32 .prompt(`Analyze this incident timeline to identify root cause:
33
34{{parse-timeline.output}}
35
36Determine:
37- Primary root cause (be specific)
38- Contributing factors
39- Which system/service failed
40- Whether this was preventable
41- Similar past incidents
42
43Use the "5 Whys" technique.`)
44
45 // Step 3: Calculate impact metrics
46 .step("calculate-impact")
47 .with("openai:gpt-4o")
48 .depends("parse-timeline")
49 .prompt(`Calculate incident impact from this timeline:
50
51{{parse-timeline.output}}
52
53Additional context:
54- Detection time: {{detectionTime}}
55- Resolution time: {{resolutionTime}}
56- Affected users: {{affectedUsers}}
57- Revenue impact: {{revenueImpact}}
58
59Calculate:
60- Total downtime (minutes)
61- MTTR (Mean Time To Recovery)
62- MTTD (Mean Time To Detect)
63- Severity level (SEV0-SEV3)
64- Customer impact score
65
66Return as structured JSON.`)
67
68 // Step 4: Generate action items
69 .step("action-items")
70 .with("anthropic:claude-3.5-sonnet")
71 .depends("root-cause", "calculate-impact")
72 .prompt(`Generate concrete action items to prevent recurrence:
73
74Root Cause: {{root-cause.output}}
75Impact: {{calculate-impact.output}}
76
77Create action items with:
78- Immediate fixes (0-7 days)
79- Short-term improvements (1-4 weeks)
80- Long-term prevention (1-3 months)
81
82Each item needs:
83- Description
84- Owner (role, not person)
85- Estimated effort
86- Priority (P0-P2)
87
88Be specific and actionable.`)
89
90 // Step 5: Write formal post-mortem
91 .step("write-postmortem")
92 .with("anthropic:claude-3.5-sonnet")
93 .depends("parse-timeline", "root-cause", "calculate-impact", "action-items")
94 .prompt(`Write a blameless post-mortem report:
95
96Timeline: {{parse-timeline.output}}
97Root Cause: {{root-cause.output}}
98Impact: {{calculate-impact.output}}
99Action Items: {{action-items.output}}
100
101Structure:
102# Incident Summary
103- Date and duration
104- Severity
105- Services affected
106- Customer impact
107
108# Timeline
109- Detection
110- Key events
111- Resolution
112
113# Root Cause Analysis
114- What happened
115- Why it happened
116- Contributing factors
117
118# Impact Assessment
119- Metrics
120- Customer effect
121- Business impact
122
123# What Went Well
124- Positive aspects of response
125
126# What Could Be Improved
127- Areas for improvement
128
129# Action Items
130- Categorized by timeframe
131- Owners assigned
132
133# Lessons Learned
134
135Use professional, blameless language.
136Target audience: Engineering and leadership.`)
137
138 .run({
139 pagerdutyEvents: pdEvents,
140 slackMessages: slackThread,
141 datadogAlerts: ddAlerts,
142 detectionTime: "2024-01-15T14:32:00Z",
143 resolutionTime: "2024-01-15T16:18:00Z",
144 affectedUsers: "~12,000",
145 revenueImpact: "$8,500 estimated",
146 });
147
148// Save to wiki/docs
149await saveToConfluence({
150 space: "Engineering",
151 title: `Post-Mortem: ${incidentTitle}`,
152 content: result.steps["write-postmortem"].output,
153});

Webhook Trigger

Auto-generate post-mortems when incidents are resolved:

1// PagerDuty webhook handler
2app.post("/webhooks/pagerduty", async (req, res) => {
3 const event = req.body;
4
5 if (event.event === "incident.resolved") {
6 const incident = event.incident;
7
8 // Fetch related data
9 const slackThread = await getSlackThread(incident.id);
10 const alerts = await getDatadogAlerts(incident.created_at, incident.resolved_at);
11
12 // Generate post-mortem
13 await relay
14 .workflow("incident-report")
15 .run({
16 pagerdutyEvents: incident.log_entries,
17 slackMessages: slackThread,
18 datadogAlerts: alerts,
19 detectionTime: incident.created_at,
20 resolutionTime: incident.resolved_at,
21 affectedUsers: incident.impacted_services.total_users,
22 });
23 }
24
25 res.sendStatus(200);
26});

Benefits

  • Time Savings: Post-mortem creation drops from 2-4 hours to 10 minutes
  • Consistency: All reports follow same structured format
  • Completeness: Never miss critical timeline events
  • Blameless Culture: AI maintains professional, learning-focused tone
Production Tip: Run this within 24 hours of incident resolution while details are fresh