PII Detector & Redactor

Automatically detect and redact personally identifiable information (PII) from documents and logs.

This workflow identifies sensitive data including names, emails, SSNs, credit cards, and more with redaction options.

Implementation

1import { relay } from "@relayplane/workflows";
2
3const result = await relay
4  .workflow("pii-detector")
5
6  // Step 1: Detect PII entities
7  .step("detect-pii")
8  .with("openai:gpt-4o")
9  .prompt(`Scan this content for PII (Personally Identifiable Information):
10
11{{content}}
12
13Identify all instances of:
14- Full names
15- Email addresses
16- Phone numbers (all formats)
17- Social Security Numbers (SSN)
18- Credit card numbers
19- Home addresses
20- IP addresses
21- Date of birth
22- Government ID numbers (passport, driver's license)
23- Medical record numbers
24- Biometric data
25- Account numbers
26
27For each finding:
28- PII type
29- Exact text/value found
30- Location (line number or context)
31- Sensitivity level (high/medium/low)
32
33Return as JSON array.`)
34
35  // Step 2: Assess risk level
36  .step("assess-risk")
37  .with("anthropic:claude-3.5-sonnet")
38  .depends("detect-pii")
39  .prompt(`Assess data privacy risk:
40
41PII Found: {{detect-pii.output}}
42
43Content Type: {{contentType}}
44Intended Use: {{intendedUse}}
45Audience: {{audience}}
46
47Evaluate:
48- GDPR implications (if EU data)
49- CCPA requirements (if California residents)
50- HIPAA violations (if health data)
51- Financial data regulations
52- Overall privacy risk score (0-100)
53
54Recommend classification: Public / Internal / Confidential / Restricted`)
55
56  // Step 3: Generate redacted version
57  .step("redact-content")
58  .with("anthropic:claude-3.5-sonnet")
59  .depends("detect-pii")
60  .prompt(`Create redacted version of content:
61
62Original: {{content}}
63PII Detected: {{detect-pii.output}}
64
65Redaction strategy:
66- Replace names with "[NAME]"
67- Replace emails with "[EMAIL]"
68- Replace SSN with "[SSN]"
69- Replace credit cards with "[CREDIT_CARD]"
70- Replace addresses with "[ADDRESS]"
71- Preserve overall meaning and context
72
73Return fully redacted content.`)
74
75  // Step 4: Anonymization suggestions
76  .step("anonymize-suggestions")
77  .with("openai:gpt-4o")
78  .depends("detect-pii", "assess-risk")
79  .prompt(`Suggest anonymization strategies:
80
81PII: {{detect-pii.output}}
82Risk Assessment: {{assess-risk.output}}
83
84For each PII type, recommend:
85- Redaction (remove entirely)
86- Masking (partial: j***@example.com)
87- Tokenization (replace with unique ID)
88- Generalization (specific → generic: "John" → "User A")
89- Synthetic data (fake but realistic)
90
91Consider the use case: {{intendedUse}}`)
92
93  // Step 5: Generate report
94  .step("pii-report")
95  .with("anthropic:claude-3.5-sonnet")
96  .depends("detect-pii", "assess-risk", "anonymize-suggestions")
97  .prompt(`Create PII detection report:
98
99Findings: {{detect-pii.output}}
100Risk: {{assess-risk.output}}
101Recommendations: {{anonymize-suggestions.output}}
102
103Format:
104# PII Detection Report
105
106## 📊 Summary
107- Total PII instances found
108- PII types detected
109- Risk level
110
111## 🔍 Detailed Findings
112- List each PII with context
113
114## ⚠️ Compliance Risks
115- GDPR/CCPA/HIPAA implications
116- Required actions
117
118## 💡 Recommendations
119- How to handle each PII type
120- Anonymization approach
121
122## ✅ Next Steps
123- Actionable items
124
125Professional, clear, compliance-focused.`)
126
127  .run({
128    content: documentText,
129    contentType: "Customer Support Transcript",
130    intendedUse: "Training machine learning model",
131    audience: "Internal data science team",
132  });
133
134// Save redacted version
135await saveFile({
136  path: "data/redacted/transcript-001.txt",
137  content: result.steps["redact-content"].output,
138});
139
140// Alert if high-risk PII found
141const risk = JSON.parse(result.steps["assess-risk"].output);
142if (risk.score > 70) {
143  await notifySecurityTeam({
144    alert: "High-risk PII detected",
145    report: result.steps["pii-report"].output,
146  });
147}

Real-time Log Sanitization

1import { relay } from "@relayplane/workflows";
2
3// Sanitize logs before storing
4async function sanitizeLogs(logEntries: string[]): Promise {
5  const batch = logEntries.join("\n");
6
7  const result = await relay
8    .workflow("pii-detector")
9    .run({
10      content: batch,
11      contentType: "Application Logs",
12      intendedUse: "Debugging and monitoring",
13      audience: "Engineering team",
14    });
15
16  const redacted = result.steps["redact-content"].output;
17  return redacted.split("\n");
18}
19
20// Use in logging pipeline
21const sanitizedLogs = await sanitizeLogs(unsafeLogs);
22await sendToDatadog(sanitizedLogs);

Database Anonymization

1// Anonymize production database for staging
2import { relay } from "@relayplane/workflows";
3
4async function anonymizeUserData(users: any[]) {
5  for (const user of users) {
6    const userData = JSON.stringify(user);
7
8    const result = await relay
9      .workflow("pii-detector")
10      .step("detect-pii")
11      .with("openai:gpt-4o")
12      .step("generate-synthetic")
13      .with("anthropic:claude-3.5-sonnet")
14      .depends("detect-pii")
15      .prompt(`Generate synthetic replacement data:
16
17Original: {{userData}}
18PII: {{detect-pii.output}}
19
20Generate realistic but fake:
21- Names (maintain gender/ethnicity distribution)
22- Emails (same domain patterns)
23- Addresses (real cities, fake streets)
24- Phone numbers (valid format, fake numbers)
25
26Preserve relationships and patterns.`)
27      .run({ userData });
28
29    const syntheticUser = result.steps["generate-synthetic"].output;
30    await updateStagingDB(user.id, syntheticUser);
31  }
32}

Sample Output

# PII Detection Report ## 📊 Summary - **Total PII Found:** 37 instances - **PII Types:** Email (12), Phone (8), SSN (2), Name (15) - **Risk Level:** HIGH (Score: 85/100) - **Classification:** RESTRICTED ## 🔍 Detailed Findings **High Sensitivity:** 1. **SSN:** "123-45-6789" (Line 47) - Detected in customer intake form - Full 9-digit format 2. **SSN:** "987-65-4321" (Line 103) **Medium Sensitivity:** 1. **Email:** "john.doe@email.com" (15 occurrences) 2. **Phone:** "(555) 123-4567" (8 occurrences) 3. **Name:** "John Doe" (15 occurrences) ## ⚠️ Compliance Risks - **GDPR:** Presence of EU citizen data requires consent and data processing agreement - **CCPA:** California residents identified - right to deletion applies - **HIPAA:** Medical record numbers detected - requires encryption at rest ## 💡 Recommendations 1. **Immediate Actions:** - Remove all SSNs before using for ML training - Encrypt document at rest - Implement access logging 2. **Anonymization Strategy:** - SSN → Full redaction (not needed for ML model) - Email → Hash or tokenize (preserve uniqueness) - Phone → Partial mask: (555) ***-**67 - Name → Replace with synthetic names ## ✅ Next Steps 1. Legal team approval before using data 2. Implement recommended redactions 3. Update data processing agreement 4. Enable audit logging for access

Regex Patterns

1// Supplement AI with regex for common patterns
2const PII_PATTERNS = {
3  ssn: /\b\d{3}-\d{2}-\d{4}\b/g,
4  email: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
5  phone: /\b(?:\d{3}-\d{3}-\d{4}|\(\d{3}\)\s*\d{3}-\d{4})\b/g,
6  creditCard: /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/g,
7  ipAddress: /\b(?:\d{1,3}\.){3}\d{1,3}\b/g,
8};
9
10function quickPIICheck(text: string): boolean {
11  return Object.values(PII_PATTERNS).some(pattern => pattern.test(text));
12}
13
14// Pre-filter before AI analysis
15if (quickPIICheck(document)) {
16  await relay.workflow("pii-detector").run({ content: document });
17}

Benefits

Compliance: Meet GDPR, CCPA, HIPAA requirements
Risk Reduction: Prevent data breaches and leaks
Safe ML Training: Use production data without privacy risks
Audit Trail: Document PII handling for compliance