Skip to main content

Working with incidents and learning lessons

Every company will face security incidents. The question isn't whether you'll have them — it's how you'll respond and what you'll learn. Companies that handle incidents well get stronger after each one. Companies that handle them poorly repeat the same mistakes.

This chapter covers the practical side of incident response: what actually happens during an incident, how to run blameless postmortems that produce real improvements, documenting incidents for future reference, and running tabletop exercises to practice before real incidents hit.

This builds on the Incident Response Plan

This chapter assumes you have an IRP in place (covered in Security policies and procedures). Here we focus on execution and learning, not the plan itself.

What really happens during an incident

The IRP gives you the framework. Here's what it actually feels like and how to navigate the chaos.

The first 30 minutes

The first half hour sets the tone. Most incidents are either contained quickly or spiral into multi-day nightmares based on what happens early.

What goes wrong:

  • Nobody takes ownership ("I thought you were handling it")
  • Time spent figuring out who to call
  • Evidence destroyed by well-meaning fixes
  • Panic decisions without thinking through consequences
  • Communication gaps (leadership finds out from Twitter)

What should happen:

Minutes 0–5: Detection and initial assessment

  • Alert received or issue reported
  • Quick triage: is this real, how bad?
  • Incident Lead identified and takes ownership
  • Decision: escalate or investigate quietly

Minutes 5–15: Mobilization

  • Create incident channel (#incident-YYYY-MM-DD-brief-name)
  • Alert response team members
  • Start incident log (time, action, who, notes)
  • Assess: is immediate containment needed?

Minutes 15–30: Initial response

  • Execute containment if needed (disable account, isolate system)
  • Preserve evidence before making changes
  • Brief update to leadership (for High/Critical)
  • Assign investigation tasks

The Incident Lead role

Someone needs to own the incident. This is usually the Security Champion, but could be any senior technical person.

Incident Lead responsibilities:

ResponsibilityWhat it means
CoordinationMake sure everyone knows what they're doing
Decision-makingMake calls when there's no clear answer
CommunicationKeep stakeholders informed
DocumentationEnsure the incident log is maintained
Time managementSet checkpoints, avoid rabbit holes
EscalationKnow when to call for help

What the Incident Lead is NOT doing:

  • Deep technical investigation (delegate this)
  • Writing code to fix things (delegate this)
  • Customer communication (delegate this)
  • Everything at once (you coordinate, others execute)

Running the incident channel

Your incident Slack/Teams channel is mission control. Keep it focused.

Channel discipline:

Good channel behavior:
- Status updates with timestamps
- Clear task assignments: "@alice please check CloudTrail for the last 24h"
- Decisions documented: "Decision: Rotating all API keys. Reason: Can't confirm scope."
- Questions clearly marked: "QUESTION: Do we have backups from before March 1?"

Bad channel behavior:
- Speculation without evidence
- Side conversations about unrelated topics
- Multiple people doing the same task
- Updates without context

Periodic status updates:

Every 30-60 minutes, Incident Lead posts a summary:

## Status Update — 14:30

**Current status:** Investigating
**Severity:** High
**Duration:** 2 hours

**What we know:**
- Unauthorized access to user database confirmed
- Access occurred via compromised admin credentials
- ~500 records potentially accessed

**What we're doing:**
- Alice: Analyzing access logs to determine full scope
- Bob: Rotating all admin credentials
- Carol: Preparing customer notification draft

**Open questions:**
- When did the compromise occur? (Reviewing logs back to Jan 1)
- Were credentials stolen or guessed?

**Next update:** 15:30 or sooner if significant development

When to escalate

Not every incident needs the CEO at 2 AM. But some do.

Escalate immediatelyCan wait until morningHandle internally
Active attacker in systemsContained breach, small scopePolicy violation, no data impact
Customer data exfiltration confirmedSuspicious activity under investigationSingle account compromise (non-admin)
Ransomware or destructive malwareVulnerability discovered (not exploited)Failed attack attempt
Public disclosure imminentThird-party breach affecting usNear-miss with no impact
Legal/regulatory implicationsMalware on single endpoint (contained)Security tool alerts (normal volume)

How to escalate:

To: [CEO/CTO name]
Subject: Security Incident - [Brief description] - [Severity]

We have a [severity] security incident that requires your awareness.

**What happened:** [2-3 sentences]
**Current impact:** [What's affected right now]
**What we're doing:** [Actions in progress]
**What we need from you:** [Decision needed, if any]

I'll update you in [timeframe] or immediately if status changes.

[Your name] - [Phone number for callback]

Blameless postmortems

The postmortem is where learning happens. Do it wrong and people hide mistakes. Do it right and you build a culture of continuous improvement.

Why blameless matters

When people fear blame, they:

  • Don't report issues ("maybe nobody will notice")
  • Hide their involvement ("it wasn't me")
  • Cover up details ("let's just fix it and move on")
  • Avoid risk entirely ("I'm not touching that system")

When people feel safe, they:

  • Report issues early ("I think I might have caused a problem")
  • Share details openly ("here's exactly what happened")
  • Propose improvements ("here's how we could prevent this")
  • Take ownership ("I'll fix it and document the process")

Blameless doesn't mean unaccountable. It means we focus on systems, not individuals. The question isn't "who screwed up?" but "what allowed this to happen and how do we prevent it?"

When to run a postmortem

Not every incident needs a formal postmortem. Here's a guide:

Incident typePostmortem?Format
Critical severityYes, mandatoryFull postmortem meeting + document
High severityYesFull or abbreviated
Medium severityUsuallyAbbreviated or async
Low severityOptionalQuick notes, no meeting
Near-miss with lessonsYesAbbreviated

Also run postmortems when:

  • The response itself had problems (even if the incident was minor)
  • There are systemic lessons to learn
  • Someone requests one
  • It's a new type of incident you haven't seen before

The postmortem meeting

Timing: Within 1 week of incident resolution (memories fade)

Duration: 45-90 minutes

Attendees:

  • Everyone involved in the response
  • Relevant stakeholders (not the whole company)
  • Facilitator (ideally not the Incident Lead — they need to participate)

Agenda:

## Postmortem Meeting Agenda

1. **Set the stage (5 min)**
- Reminder: This is blameless. Focus on systems, not people.
- Goal: Understand what happened and prevent recurrence.

2. **Timeline review (15-20 min)**
- Walk through the incident chronologically
- Fill in gaps in the timeline
- No judgment, just facts

3. **What went well (10 min)**
- What worked? What should we do more of?
- Recognition for good responses

4. **What could improve (15-20 min)**
- Where did we struggle?
- What slowed us down?
- What information was missing?

5. **Root cause analysis (15 min)**
- Why did this happen?
- Keep asking "why" until you reach systemic issues
- Usually 3-5 levels deep

6. **Action items (10-15 min)**
- What specific changes will prevent recurrence?
- Assign owners and due dates
- Be realistic about capacity

7. **Wrap-up (5 min)**
- Confirm action item owners
- Set follow-up date to check progress
- Thank everyone

The Five Whys technique

Keep asking "why" until you reach root causes you can actually fix.

Example:

Incident: Customer data was exposed via public S3 bucket

Why was the bucket public?
→ Developer set it to public during testing and forgot to change it.

Why did they set it to public?
→ They needed external access for a demo and didn't know another way.

Why didn't they know another way?
→ We don't have documented patterns for secure external sharing.

Why don't we have documented patterns?
→ Nobody has taken ownership of cloud security documentation.

Why hasn't anyone taken ownership?
→ Cloud security responsibilities aren't clearly defined.

Root causes:
1. No secure sharing documentation
2. Unclear cloud security ownership
3. No process to verify bucket permissions before production

Action items:
1. Document secure external sharing patterns
2. Assign cloud security ownership to DevOps team
3. Add bucket permission check to deployment pipeline

Notice how we moved from "developer made a mistake" to systemic issues we can actually fix.

Facilitating without blame

The facilitator's job is to keep the discussion productive and safe.

Language to use:

Instead of...Say...
"Who made this change?""Let's look at what changes were made and when."
"Why didn't you catch this?""What would have helped us catch this earlier?"
"That was a mistake.""This is where things started to go wrong. What contributed?"
"You should have known.""What information would have been helpful here?"
"Whose fault is this?""What systems or processes could we improve?"

Redirect blame when it appears:

Participant: "Bob should have known not to do that."

Facilitator: "Let's focus on the system. If someone could make this mistake, others might too. What would prevent anyone from making this error?"

The postmortem document

The document is the permanent record. It should be useful for anyone who reads it later.

Postmortem template:

# Postmortem: [Incident Title]

**Date of incident:** YYYY-MM-DD
**Date of postmortem:** YYYY-MM-DD
**Authors:** [Names]
**Status:** Draft / Final
**Severity:** Critical / High / Medium / Low

## Executive summary

[2-3 paragraphs that anyone in the company could understand. What happened,
what was the impact, what are we doing about it.]

## Impact

- **Duration:** [Time from detection to resolution]
- **Users affected:** [Number and type]
- **Data affected:** [What data, how much]
- **Financial impact:** [If applicable]
- **Reputation impact:** [If applicable]

## Timeline

All times in [timezone].

| Time | Event |
|------|-------|
| 09:15 | Alert triggered: unusual database query volume |
| 09:18 | On-call engineer acknowledged alert |
| 09:25 | Investigation started, Incident Lead assigned |
| ... | ... |
| 14:30 | Incident resolved, monitoring confirmed normal |

## Root cause analysis

### What happened

[Technical explanation of what went wrong. Be specific.]

### Why it happened

[The Five Whys analysis. Get to systemic causes.]

### Contributing factors

[Other things that made it worse or delayed response.]

## What went well

- [Specific thing that worked]
- [Specific thing that worked]
- [Recognition for individuals or teams]

## What could improve

- [Specific area for improvement]
- [Specific area for improvement]

## Action items

| Action | Owner | Due date | Status |
|--------|-------|----------|--------|
| Add bucket permission check to CI/CD | @alice | 2024-03-15 | In progress |
| Document secure sharing patterns | @bob | 2024-03-22 | Not started |
| Assign cloud security ownership | @cto | 2024-03-01 | Done |

## Lessons learned

[Key takeaways that others should know about. What would you tell your
past self before this incident?]

## Appendix

[Relevant logs, screenshots, or technical details that support the analysis.]

Following up on action items

Action items are useless if they never get done.

Follow-up process:

  1. Assign real owners — Not teams, specific people
  2. Set realistic due dates — Rushed fixes cause new incidents
  3. Track in your issue tracker — Not just the postmortem doc
  4. Check progress weekly — Security Champion reviews open items
  5. Close the loop — When done, update the postmortem doc
  6. Report completion — Share that actions were completed

Monthly action item review:

## Postmortem Action Item Review — March 2024

### Completed this month
- [Incident X] Add bucket permission check to CI/CD — Done 3/12
- [Incident Y] Update password policy — Done 3/8

### In progress
- [Incident X] Document secure sharing patterns — 75% complete, due 3/22
- [Incident Z] Implement rate limiting — On track for 3/30

### Overdue
- [Incident Y] Security training for new hires — Due 3/1, blocked on content
- New due date: 3/31
- Blocker: Waiting for training platform access

### Metrics
- Total open items: 8
- Completed this month: 5
- Overdue: 1

Building a security knowledge base

Incidents are expensive lessons. Don't waste them by forgetting what you learned.

What to document

Document typePurposeExample
PostmortemsDetailed incident analysisS3 bucket exposure — March 2024
RunbooksStep-by-step response proceduresHow to respond to ransomware
PlaybooksDecision frameworks for scenariosWhen to notify customers
PatternsSecure ways to do common thingsHow to share files externally
Anti-patternsCommon mistakes to avoidS3 bucket misconfigurations
Tool guidesHow to use security toolsUsing CloudTrail for investigation

Organizing the knowledge base

security-knowledge-base/
├── README.md # Index and how to use
├── incidents/
│ ├── 2024-03-15-s3-exposure.md
│ ├── 2024-02-28-phishing-account-compromise.md
│ └── template.md
├── runbooks/
│ ├── compromised-account.md
│ ├── ransomware.md
│ ├── data-breach.md
│ └── leaked-secrets.md
├── playbooks/
│ ├── customer-notification-decision.md
│ ├── severity-classification.md
│ └── escalation-criteria.md
├── patterns/
│ ├── secure-file-sharing.md
│ ├── secrets-management.md
│ └── access-control-setup.md
└── tools/
├── cloudtrail-investigation.md
├── burp-suite-basics.md
└── log-analysis.md

Making knowledge findable

The best knowledge base is useless if people can't find things.

Make it searchable:

  • Use consistent naming conventions
  • Add tags or categories
  • Include common search terms in documents
  • Create an index page with links

Make it current:

  • Review quarterly for outdated content
  • Update after each incident
  • Mark deprecated content clearly
  • Include "last updated" dates

Make it accessible:

  • Store where people already work (wiki, Git, Notion)
  • Link from incident channels
  • Include in onboarding
  • Reference in training

Learning from others' incidents

You don't have to experience every incident yourself. Learn from public breaches.

Sources for breach analysis:

Monthly external incident review:

Pick one significant public breach each month. Analyze it as if it happened to you:

  • What was the attack vector?
  • Could this happen to us?
  • What would we do differently?
  • What can we implement proactively?

Tabletop exercises

Tabletop exercises are simulated incidents where you practice response without the pressure of a real attack. Think of them as fire drills for security.

Why tabletop exercises matter

  • Test your plan — Find gaps before real incidents reveal them
  • Build muscle memory — Practice makes response automatic
  • Train the team — New people learn how you respond
  • Identify confusion — Who does what? Now you'll know.
  • Reduce stress — Familiar situations are less scary

Planning a tabletop exercise

Frequency: Quarterly at minimum, monthly if you can

Duration: 1-2 hours

Participants:

  • Incident response team
  • Leadership (at least occasionally)
  • Anyone who would be involved in real incidents

Roles:

RoleResponsibility
FacilitatorRuns the exercise, provides scenario updates
ObserverTakes notes on process, doesn't participate
ParticipantsRespond to the scenario as they would in reality

Tabletop exercise format

1. Pre-exercise preparation (Facilitator)

  • Write the scenario with realistic details
  • Prepare 3-4 "injects" (new information revealed during exercise)
  • Notify participants of time commitment
  • Set up a dedicated channel/room

2. Exercise structure

Introduction (10 min): Explain rules ("respond as if this were real"), clarify this is practice not evaluation, read the initial scenario.

Response Phase 1 (20–30 min): Team discusses initial response. Facilitator asks probing questions. Inject 1: new information changes the situation.

Response Phase 2 (20–30 min): Team adjusts based on new information. Facilitator challenges assumptions. Inject 2: escalation or complication.

Resolution (15–20 min): Team works toward resolution. Inject 3: final twist (optional). Wrap up the scenario.

Debrief (20–30 min): What went well? What was confusing? What would we do differently? Action items for improvement.

Sample tabletop scenario: Data breach

Initial scenario:

## Scenario: Data Breach

**Date/Time:** Tuesday, 2:30 PM

**Situation:**
A security researcher has contacted your security@ email address. They claim
to have found a database backup file containing customer information publicly
accessible on one of your cloud storage buckets. They've provided a screenshot
showing customer names, email addresses, and hashed passwords. They say they
haven't shared this with anyone else yet, and are willing to work with you
before disclosing.

**What do you do?**

Inject 1 (after 20 min):

## New Information

You've verified the researcher's claim. The bucket was indeed public.
CloudTrail logs show it's been public for 3 weeks. The backup contains
12,000 customer records. You've made the bucket private.

Your CEO is asking for an update — they have a board call in 2 hours.

**Additional questions:**
- What do you tell the CEO?
- Do you need to notify customers?
- What's your regulatory timeline?

Inject 2 (after 40 min):

## New Information

A tech journalist has DM'd your company Twitter account asking for
comment on "the data breach." They say they're publishing a story
tomorrow morning.

Also, your legal counsel has reminded you that you have customers
in the EU (GDPR) and California (CCPA).

**Additional questions:**
- How do you handle the journalist?
- What's your customer communication plan?
- Who's drafting the public statement?

Inject 3 (after 60 min):

## Final Information

The backup was created by a contractor who left 6 months ago. They
were using a personal cloud account to work from home. You don't
have logs of what else they might have copied.

The story has been published. It's trending on Hacker News.

**Wrap-up questions:**
- What's your 24-hour action plan?
- How do you handle the contractor situation?
- What systemic changes would prevent this?

Other tabletop scenarios

Ransomware attack:

  • 8 AM Monday, employees can't access files
  • Ransom demand: $50,000 in Bitcoin within 48 hours
  • Injects: Backups are from 2 weeks ago; attackers threaten to publish data

Compromised executive account:

  • CFO's email is sending wire transfer requests
  • Finance already processed one for $25,000
  • Injects: CFO is on vacation with limited connectivity; attacker is responding to attempts to verify

Insider threat:

  • Departing employee downloaded customer list before resignation
  • They're joining a competitor
  • Injects: Legal limitations on what you can do; the data is already on personal devices

Supply chain attack:

  • Critical vendor announces they were breached
  • They had access to your production environment
  • Injects: Vendor is unresponsive; you find unauthorized API calls in your logs

Debrief questions

After the exercise, discuss:

  1. Process:

    • Did everyone know their role?
    • Was communication clear?
    • Did we follow our IRP?
    • Where did we get stuck?
  2. Decisions:

    • Were decisions made quickly enough?
    • Did we have the information we needed?
    • Who had authority to decide what?
  3. Communication:

    • Would leadership be satisfied with our updates?
    • Did we consider all stakeholders?
    • Was external communication handled well?
  4. Resources:

    • Did we have the right people involved?
    • What tools or information were missing?
    • Do we need external help we don't have arranged?
  5. Improvements:

    • What do we need to change?
    • What should we add to our IRP?
    • What training is needed?

After the exercise

Immediate:

  • Document findings while fresh
  • Share summary with participants
  • Thank everyone for their time

Within 1 week:

  • Create action items from gaps identified
  • Update IRP or runbooks if needed
  • Schedule next exercise

Track over time:

  • Are we getting faster at response?
  • Are the same issues recurring?
  • Is participation improving?

Common mistakes

  1. Treating postmortems as punishment — People stop reporting if they fear blame
  2. Not following up on action items — Lessons aren't learned if nothing changes
  3. Postmortem fatigue — Not every incident needs a full postmortem
  4. Skipping tabletops because "we're too busy" — Practice prevents expensive mistakes
  5. Documenting for compliance, not learning — Write for future you, not auditors
  6. Same people always involved — Rotate who leads exercises
  7. Unrealistic scenarios — Base tabletops on your actual risks
  8. No knowledge base — Reinventing the wheel each incident
  9. Leadership opt-out — Executives need to participate sometimes
  10. Perfecting the plan instead of practicing — A tested okay plan beats an untested perfect plan

Workshop: incident response practice

Part 1: Create incident documentation template (1 hour)

  1. Customize the postmortem template:

    • Adapt to your company's structure
    • Add relevant sections for your industry
    • Create fillable version in your wiki
  2. Create incident log template:

    • Simple table format
    • Pre-filled headers
    • Instructions for use
  3. Set up knowledge base structure:

    • Create folder structure
    • Add README with instructions
    • Link from main documentation

Part 2: Run a tabletop exercise (2 hours)

  1. Prepare (30 min before):

    • Choose a scenario relevant to your risks
    • Prepare 2-3 injects
    • Set up dedicated channel
  2. Run the exercise (1 hour):

    • Read initial scenario
    • Let team respond
    • Inject new information at intervals
    • Take notes on process
  3. Debrief (30-45 min):

    • Discuss what worked
    • Identify gaps
    • Create action items
    • Document findings

Artifacts to produce

After this workshop, you should have:

  • Customized postmortem template
  • Incident log template
  • Knowledge base structure set up
  • First tabletop exercise completed
  • Action items from tabletop findings
  • Schedule for future tabletop exercises

Self-check questions

  1. What's the difference between a postmortem and an incident log?
  2. Why is blameless culture important for security?
  3. What's the Five Whys technique and when do you use it?
  4. How often should you run tabletop exercises?
  5. What should you do if the same issues keep appearing in postmortems?
  6. Who should attend a postmortem meeting?
  7. What's the purpose of "injects" in a tabletop exercise?
  8. How do you handle blame during a postmortem discussion?
  9. Where should you store incident documentation?
  10. What's the difference between a runbook and a playbook?

How to explain this to leadership

The pitch: "We'll have security incidents. The question is whether we learn from them. I want to run regular practice exercises and document what we learn so we get better over time instead of repeating mistakes."

What you need:

  • 2 hours quarterly for tabletop exercises
  • Participation from key people (including leadership occasionally)
  • A place to document incidents (wiki, Notion, etc.)
  • Authority to follow up on action items

The ROI:

  • Faster incident response (practice makes permanent)
  • Fewer repeated incidents (we learn from mistakes)
  • Lower incident costs (catch problems earlier)
  • Better prepared team (less panic, better decisions)
  • Compliance checkbox (many frameworks require incident response testing)

Metrics to track:

  • Mean time to detect incidents
  • Mean time to resolve incidents
  • Number of action items completed vs. opened
  • Tabletop exercise frequency
  • Recurring issues (should decrease)

Conclusion

Every incident is the most valuable input your security program will ever get — if you treat it that way. The blameless postmortem isn't a formality. It's how you turn a bad day into a better system.

The organizations that get better after incidents are the ones that run postmortems even when they're uncomfortable.

What's next

Next: measuring security program effectiveness — how to track whether any of this is actually working.