Incident Report Generator

Automatically generate post-mortem reports from incident timelines and Slack conversations.

This workflow analyzes incident data, creates timelines, identifies root causes, and generates structured post-mortems.

Implementation

1import { relay } from "@relayplane/workflows";
2
3const result = await relay
4  .workflow("incident-report")
5
6  // Step 1: Parse incident timeline from multiple sources
7  .step("parse-timeline")
8  .with("openai:gpt-4o")
9  .prompt(`Parse this incident data into a structured timeline:
10
11PagerDuty Events:
12{{pagerdutyEvents}}
13
14Slack #incidents Channel:
15{{slackMessages}}
16
17Monitoring Alerts:
18{{datadogAlerts}}
19
20Create chronological timeline with:
21- Timestamp (ISO format)
22- Event description
23- Who took action
24- System component affected
25
26Return as JSON array sorted by time.`)
27
28  // Step 2: Identify root cause
29  .step("root-cause")
30  .with("anthropic:claude-3.5-sonnet")
31  .depends("parse-timeline")
32  .prompt(`Analyze this incident timeline to identify root cause:
33
34{{parse-timeline.output}}
35
36Determine:
37- Primary root cause (be specific)
38- Contributing factors
39- Which system/service failed
40- Whether this was preventable
41- Similar past incidents
42
43Use the "5 Whys" technique.`)
44
45  // Step 3: Calculate impact metrics
46  .step("calculate-impact")
47  .with("openai:gpt-4o")
48  .depends("parse-timeline")
49  .prompt(`Calculate incident impact from this timeline:
50
51{{parse-timeline.output}}
52
53Additional context:
54- Detection time: {{detectionTime}}
55- Resolution time: {{resolutionTime}}
56- Affected users: {{affectedUsers}}
57- Revenue impact: {{revenueImpact}}
58
59Calculate:
60- Total downtime (minutes)
61- MTTR (Mean Time To Recovery)
62- MTTD (Mean Time To Detect)
63- Severity level (SEV0-SEV3)
64- Customer impact score
65
66Return as structured JSON.`)
67
68  // Step 4: Generate action items
69  .step("action-items")
70  .with("anthropic:claude-3.5-sonnet")
71  .depends("root-cause", "calculate-impact")
72  .prompt(`Generate concrete action items to prevent recurrence:
73
74Root Cause: {{root-cause.output}}
75Impact: {{calculate-impact.output}}
76
77Create action items with:
78- Immediate fixes (0-7 days)
79- Short-term improvements (1-4 weeks)
80- Long-term prevention (1-3 months)
81
82Each item needs:
83- Description
84- Owner (role, not person)
85- Estimated effort
86- Priority (P0-P2)
87
88Be specific and actionable.`)
89
90  // Step 5: Write formal post-mortem
91  .step("write-postmortem")
92  .with("anthropic:claude-3.5-sonnet")
93  .depends("parse-timeline", "root-cause", "calculate-impact", "action-items")
94  .prompt(`Write a blameless post-mortem report:
95
96Timeline: {{parse-timeline.output}}
97Root Cause: {{root-cause.output}}
98Impact: {{calculate-impact.output}}
99Action Items: {{action-items.output}}
100
101Structure:
102# Incident Summary
103- Date and duration
104- Severity
105- Services affected
106- Customer impact
107
108# Timeline
109- Detection
110- Key events
111- Resolution
112
113# Root Cause Analysis
114- What happened
115- Why it happened
116- Contributing factors
117
118# Impact Assessment
119- Metrics
120- Customer effect
121- Business impact
122
123# What Went Well
124- Positive aspects of response
125
126# What Could Be Improved
127- Areas for improvement
128
129# Action Items
130- Categorized by timeframe
131- Owners assigned
132
133# Lessons Learned
134
135Use professional, blameless language.
136Target audience: Engineering and leadership.`)
137
138  .run({
139    pagerdutyEvents: pdEvents,
140    slackMessages: slackThread,
141    datadogAlerts: ddAlerts,
142    detectionTime: "2024-01-15T14:32:00Z",
143    resolutionTime: "2024-01-15T16:18:00Z",
144    affectedUsers: "~12,000",
145    revenueImpact: "$8,500 estimated",
146  });
147
148// Save to wiki/docs
149await saveToConfluence({
150  space: "Engineering",
151  title: `Post-Mortem: ${incidentTitle}`,
152  content: result.steps["write-postmortem"].output,
153});

Webhook Trigger

Auto-generate post-mortems when incidents are resolved:

1// PagerDuty webhook handler
2app.post("/webhooks/pagerduty", async (req, res) => {
3  const event = req.body;
4
5  if (event.event === "incident.resolved") {
6    const incident = event.incident;
7
8    // Fetch related data
9    const slackThread = await getSlackThread(incident.id);
10    const alerts = await getDatadogAlerts(incident.created_at, incident.resolved_at);
11
12    // Generate post-mortem
13    await relay
14      .workflow("incident-report")
15      .run({
16        pagerdutyEvents: incident.log_entries,
17        slackMessages: slackThread,
18        datadogAlerts: alerts,
19        detectionTime: incident.created_at,
20        resolutionTime: incident.resolved_at,
21        affectedUsers: incident.impacted_services.total_users,
22      });
23  }
24
25  res.sendStatus(200);
26});

Benefits

Time Savings: Post-mortem creation drops from 2-4 hours to 10 minutes
Consistency: All reports follow same structured format
Completeness: Never miss critical timeline events
Blameless Culture: AI maintains professional, learning-focused tone

Production Tip: Run this within 24 hours of incident resolution while details are fresh