Log Analyzer

Automatically analyze application logs to detect errors, anomalies, and performance issues.

This workflow parses logs, identifies patterns, correlates errors, and generates actionable insights.

Implementation

1import { relay } from "@relayplane/workflows";
2
3const result = await relay
4 .workflow("log-analyzer")
5
6 // Step 1: Parse and categorize log entries
7 .step("parse-logs")
8 .with("openai:gpt-4o")
9 .prompt(`Parse and categorize these application logs:
10
11{{logEntries}}
12
13For each unique error or pattern:
14- Error type/category
15- Frequency count
16- First occurrence timestamp
17- Severity (critical/error/warning/info)
18- Affected service/component
19- Common error messages
20
21Return structured summary grouped by error type.`)
22
23 // Step 2: Identify root causes
24 .step("root-cause")
25 .with("anthropic:claude-3.5-sonnet")
26 .depends("parse-logs")
27 .prompt(`Analyze error patterns for root causes:
28
29Parsed Logs: {{parse-logs.output}}
30
31For top 5 most frequent errors:
32- Likely root cause
33- Evidence from log patterns
34- Related errors (correlation)
35- Time patterns (spike times, recurring intervals)
36- Impacted user flows
37
38Use stack traces and error messages to infer causation.`)
39
40 // Step 3: Detect anomalies
41 .step("detect-anomalies")
42 .with("openai:gpt-4o")
43 .depends("parse-logs")
44 .prompt(`Detect anomalies in log patterns:
45
46Log Summary: {{parse-logs.output}}
47Baseline Stats: {{baselineStats}}
48
49Identify:
50- Error rate spikes (>2σ from baseline)
51- New error types (not seen in baseline)
52- Performance degradation (slow queries, timeouts)
53- Unusual traffic patterns
54- Failed authentication spikes
55
56For each anomaly:
57- What changed
58- When it started
59- Potential impact
60- Urgency level`)
61
62 // Step 4: Extract performance insights
63 .step("performance-analysis")
64 .with("anthropic:claude-3.5-sonnet")
65 .depends("parse-logs")
66 .prompt(`Analyze performance from logs:
67
68{{logEntries}}
69
70Extract metrics:
71- Average response times by endpoint
72- Slow queries (>1s)
73- Database connection pool exhaustion
74- Memory usage warnings
75- Cache hit/miss rates
76
77Identify bottlenecks and optimization opportunities.`)
78
79 // Step 5: Generate incident report
80 .step("incident-report")
81 .with("anthropic:claude-3.5-sonnet")
82 .depends("parse-logs", "root-cause", "detect-anomalies", "performance-analysis")
83 .prompt(`Create incident analysis report:
84
85Errors: {{parse-logs.output}}
86Root Causes: {{root-cause.output}}
87Anomalies: {{detect-anomalies.output}}
88Performance: {{performance-analysis.output}}
89
90Structure:
91# Log Analysis Report - {{timeRange}}
92
93## 🚨 Critical Issues
94- Errors requiring immediate attention
95
96## 📊 Error Summary
97- Top 10 errors by frequency
98- Error rate trends
99
100## 🔍 Root Cause Analysis
101- Primary issues identified
102- Correlation patterns
103
104## ⚡ Performance Insights
105- Slow operations
106- Resource constraints
107
108## 💡 Recommended Actions
109- Prioritized fixes
110- Monitoring improvements
111
112Use specific log excerpts as evidence.`)
113
114 .run({
115 logEntries: logs,
116 baselineStats: {
117 avgErrorRate: 0.05,
118 avgResponseTime: 250,
119 typicalErrorTypes: ["validation_error", "not_found"],
120 },
121 timeRange: "Last 24 hours",
122 });
123
124// Send alerts for critical issues
125const report = result.steps["incident-report"].output;
126if (report.includes("CRITICAL")) {
127 await sendPagerDuty({
128 severity: "critical",
129 title: "Critical errors detected in logs",
130 details: report,
131 });
132}

Real-time Log Monitoring

1import { relay } from "@relayplane/workflows";
2import { createReadStream } from "fs";
3import { createInterface } from "readline";
4
5// Stream logs and analyze in batches
6async function monitorLogs(logFile: string) {
7 const logBuffer: string[] = [];
8 const BATCH_SIZE = 1000;
9
10 const fileStream = createReadStream(logFile);
11 const rl = createInterface({
12 input: fileStream,
13 crlfDelay: Infinity,
14 });
15
16 for await (const line of rl) {
17 logBuffer.push(line);
18
19 if (logBuffer.length >= BATCH_SIZE) {
20 // Analyze batch
21 const analysis = await relay
22 .workflow("log-analyzer")
23 .run({
24 logEntries: logBuffer.join("\n"),
25 baselineStats: await getBaseline(),
26 timeRange: "Last hour",
27 });
28
29 // Check for anomalies
30 const anomalies = analysis.steps["detect-anomalies"].output;
31 if (anomalies.includes("CRITICAL")) {
32 await alertOnCall(anomalies);
33 }
34
35 // Clear buffer
36 logBuffer.length = 0;
37 }
38 }
39}
40
41// Run continuously
42setInterval(() => monitorLogs("/var/log/app.log"), 60000);

Integration with Log Aggregators

1// DataDog integration
2import { relay } from "@relayplane/workflows";
3
4async function analyzeDDLogs() {
5 const logs = await datadogClient.logs.list({
6 filter: {
7 query: "status:error",
8 from: new Date(Date.now() - 3600000), // Last hour
9 to: new Date(),
10 },
11 page: { limit: 5000 },
12 });
13
14 const analysis = await relay
15 .workflow("log-analyzer")
16 .run({
17 logEntries: logs.data.map(l => JSON.stringify(l.attributes)).join("\n"),
18 baselineStats: await getHourlyBaseline(),
19 });
20
21 // Post summary to Slack
22 await postToSlack({
23 channel: "#engineering-alerts",
24 text: analysis.steps["incident-report"].output,
25 });
26}
27
28// CloudWatch Logs
29import { CloudWatchLogsClient, FilterLogEventsCommand } from "@aws-sdk/client-cloudwatch-logs";
30
31async function analyzeCloudWatchLogs(logGroup: string) {
32 const client = new CloudWatchLogsClient({});
33 const command = new FilterLogEventsCommand({
34 logGroupName: logGroup,
35 startTime: Date.now() - 3600000,
36 filterPattern: "ERROR",
37 });
38
39 const response = await client.send(command);
40 const logs = response.events?.map(e => e.message).join("\n") || "";
41
42 return await relay.workflow("log-analyzer").run({ logEntries: logs });
43}

Sample Output

# Log Analysis Report - Last 24 hours ## 🚨 Critical Issues 1. **Database Connection Pool Exhausted** (247 occurrences) - Started: 14:32 UTC - Cause: Connection leak in user service - Impact: 15% of requests failing - Action: Restart service, patch connection handling 2. **Authentication Service Down** (89 occurrences) - Started: 18:15 UTC (ongoing) - Cause: Downstream auth0 timeout - Impact: All logins failing - Action: Enable fallback auth, contact Auth0 ## 📊 Error Summary - Total errors: 1,247 (+340% vs baseline) - Error rate: 2.1% (baseline: 0.05%) - New error types: 3 - Top errors: 1. ConnectionPoolExhausted (247) 2. AuthenticationTimeout (89) 3. RateLimitExceeded (63) ## 🔍 Root Cause Analysis - **Connection leaks**: Users service not releasing DB connections after errors - **Cascading failures**: Auth timeout causing retry storms - **Missing circuit breaker**: No fallback for auth service failures ## ⚡ Performance Insights - P95 response time: 3.2s (up from 450ms baseline) - 12 endpoints with response time over 5s - Database query time increased 4x - 89% of requests waiting on connections

Customization

  • Add regex patterns for custom error formats
  • Machine learning for anomaly detection thresholds
  • Correlation with deployment events
  • User impact estimation (affected user IDs)
  • Auto-create Jira tickets for recurring issues