Working with incidents and learning lessons

Every company will face security incidents. The question isn't whether you'll have them — it's how you'll respond and what you'll learn. Companies that handle incidents well get stronger after each one. Companies that handle them poorly repeat the same mistakes.

This chapter covers the practical side of incident response: what actually happens during an incident, how to run blameless postmortems that produce real improvements, documenting incidents for future reference, and running tabletop exercises to practice before real incidents hit.

This builds on the Incident Response Plan

This chapter assumes you have an IRP in place (covered in Security policies and procedures). Here we focus on execution and learning, not the plan itself.

What really happens during an incident

The IRP gives you the framework. Here's what it actually feels like and how to navigate the chaos.

The first 30 minutes

The first half hour sets the tone. Most incidents are either contained quickly or spiral into multi-day nightmares based on what happens early.

What goes wrong:

Nobody takes ownership ("I thought you were handling it")
Time spent figuring out who to call
Evidence destroyed by well-meaning fixes
Panic decisions without thinking through consequences
Communication gaps (leadership finds out from Twitter)

What should happen:

Minutes 0–5: Detection and initial assessment

Alert received or issue reported
Quick triage: is this real, how bad?
Incident Lead identified and takes ownership
Decision: escalate or investigate quietly

Minutes 5–15: Mobilization

Create incident channel (#incident-YYYY-MM-DD-brief-name)
Alert response team members
Start incident log (time, action, who, notes)
Assess: is immediate containment needed?

Minutes 15–30: Initial response

Execute containment if needed (disable account, isolate system)
Preserve evidence before making changes
Brief update to leadership (for High/Critical)
Assign investigation tasks

The Incident Lead role

Someone needs to own the incident. This is usually the Security Champion, but could be any senior technical person.

Incident Lead responsibilities:

Responsibility	What it means
Coordination	Make sure everyone knows what they're doing
Decision-making	Make calls when there's no clear answer
Communication	Keep stakeholders informed
Documentation	Ensure the incident log is maintained
Time management	Set checkpoints, avoid rabbit holes
Escalation	Know when to call for help

What the Incident Lead is NOT doing:

Deep technical investigation (delegate this)
Writing code to fix things (delegate this)
Customer communication (delegate this)
Everything at once (you coordinate, others execute)

Running the incident channel

Your incident Slack/Teams channel is mission control. Keep it focused.

Channel discipline:

Good channel behavior:
- Status updates with timestamps
- Clear task assignments: "@alice please check CloudTrail for the last 24h"
- Decisions documented: "Decision: Rotating all API keys. Reason: Can't confirm scope."
- Questions clearly marked: "QUESTION: Do we have backups from before March 1?"

Bad channel behavior:
- Speculation without evidence
- Side conversations about unrelated topics
- Multiple people doing the same task
- Updates without context

Periodic status updates:

Every 30-60 minutes, Incident Lead posts a summary:

## Status Update — 14:30

**Current status:** Investigating
**Severity:** High
**Duration:** 2 hours

**What we know:**
- Unauthorized access to user database confirmed
- Access occurred via compromised admin credentials
- ~500 records potentially accessed

**What we're doing:**
- Alice: Analyzing access logs to determine full scope
- Bob: Rotating all admin credentials
- Carol: Preparing customer notification draft

**Open questions:**
- When did the compromise occur? (Reviewing logs back to Jan 1)
- Were credentials stolen or guessed?

**Next update:** 15:30 or sooner if significant development

When to escalate

Not every incident needs the CEO at 2 AM. But some do.

Escalate immediately	Can wait until morning	Handle internally
Active attacker in systems	Contained breach, small scope	Policy violation, no data impact
Customer data exfiltration confirmed	Suspicious activity under investigation	Single account compromise (non-admin)
Ransomware or destructive malware	Vulnerability discovered (not exploited)	Failed attack attempt
Public disclosure imminent	Third-party breach affecting us	Near-miss with no impact
Legal/regulatory implications	Malware on single endpoint (contained)	Security tool alerts (normal volume)

How to escalate:

To: [CEO/CTO name]
Subject: Security Incident - [Brief description] - [Severity]

We have a [severity] security incident that requires your awareness.

**What happened:** [2-3 sentences]
**Current impact:** [What's affected right now]
**What we're doing:** [Actions in progress]
**What we need from you:** [Decision needed, if any]

I'll update you in [timeframe] or immediately if status changes.

[Your name] - [Phone number for callback]

Blameless postmortems

The postmortem is where learning happens. Do it wrong and people hide mistakes. Do it right and you build a culture of continuous improvement.

Why blameless matters

When people fear blame, they:

Don't report issues ("maybe nobody will notice")
Hide their involvement ("it wasn't me")
Cover up details ("let's just fix it and move on")
Avoid risk entirely ("I'm not touching that system")

When people feel safe, they:

Report issues early ("I think I might have caused a problem")
Share details openly ("here's exactly what happened")
Propose improvements ("here's how we could prevent this")
Take ownership ("I'll fix it and document the process")

Blameless doesn't mean unaccountable. It means we focus on systems, not individuals. The question isn't "who screwed up?" but "what allowed this to happen and how do we prevent it?"

When to run a postmortem

Not every incident needs a formal postmortem. Here's a guide:

Incident type	Postmortem?	Format
Critical severity	Yes, mandatory	Full postmortem meeting + document
High severity	Yes	Full or abbreviated
Medium severity	Usually	Abbreviated or async
Low severity	Optional	Quick notes, no meeting
Near-miss with lessons	Yes	Abbreviated

Also run postmortems when:

The response itself had problems (even if the incident was minor)
There are systemic lessons to learn
Someone requests one
It's a new type of incident you haven't seen before

The postmortem meeting

Timing: Within 1 week of incident resolution (memories fade)

Duration: 45-90 minutes

Attendees:

Everyone involved in the response
Relevant stakeholders (not the whole company)
Facilitator (ideally not the Incident Lead — they need to participate)

Agenda:

## Postmortem Meeting Agenda

1. **Set the stage (5 min)**
   - Reminder: This is blameless. Focus on systems, not people.
   - Goal: Understand what happened and prevent recurrence.

2. **Timeline review (15-20 min)**
   - Walk through the incident chronologically
   - Fill in gaps in the timeline
   - No judgment, just facts

3. **What went well (10 min)**
   - What worked? What should we do more of?
   - Recognition for good responses

4. **What could improve (15-20 min)**
   - Where did we struggle?
   - What slowed us down?
   - What information was missing?

5. **Root cause analysis (15 min)**
   - Why did this happen?
   - Keep asking "why" until you reach systemic issues
   - Usually 3-5 levels deep

6. **Action items (10-15 min)**
   - What specific changes will prevent recurrence?
   - Assign owners and due dates
   - Be realistic about capacity

7. **Wrap-up (5 min)**
   - Confirm action item owners
   - Set follow-up date to check progress
   - Thank everyone

The Five Whys technique

Keep asking "why" until you reach root causes you can actually fix.

Example:

Incident: Customer data was exposed via public S3 bucket

Why was the bucket public?
→ Developer set it to public during testing and forgot to change it.

Why did they set it to public?
→ They needed external access for a demo and didn't know another way.

Why didn't they know another way?
→ We don't have documented patterns for secure external sharing.

Why don't we have documented patterns?
→ Nobody has taken ownership of cloud security documentation.

Why hasn't anyone taken ownership?
→ Cloud security responsibilities aren't clearly defined.

Root causes:
1. No secure sharing documentation
2. Unclear cloud security ownership
3. No process to verify bucket permissions before production

Action items:
1. Document secure external sharing patterns
2. Assign cloud security ownership to DevOps team
3. Add bucket permission check to deployment pipeline

Notice how we moved from "developer made a mistake" to systemic issues we can actually fix.

Facilitating without blame

The facilitator's job is to keep the discussion productive and safe.

Language to use:

Instead of...	Say...
"Who made this change?"	"Let's look at what changes were made and when."
"Why didn't you catch this?"	"What would have helped us catch this earlier?"
"That was a mistake."	"This is where things started to go wrong. What contributed?"
"You should have known."	"What information would have been helpful here?"
"Whose fault is this?"	"What systems or processes could we improve?"

Redirect blame when it appears:

Participant: "Bob should have known not to do that."

Facilitator: "Let's focus on the system. If someone could make this mistake, others might too. What would prevent anyone from making this error?"

The postmortem document

The document is the permanent record. It should be useful for anyone who reads it later.

Postmortem template:

# Postmortem: [Incident Title]

**Date of incident:** YYYY-MM-DD
**Date of postmortem:** YYYY-MM-DD
**Authors:** [Names]
**Status:** Draft / Final
**Severity:** Critical / High / Medium / Low

## Executive summary

[2-3 paragraphs that anyone in the company could understand. What happened, 
what was the impact, what are we doing about it.]

## Impact

- **Duration:** [Time from detection to resolution]
- **Users affected:** [Number and type]
- **Data affected:** [What data, how much]
- **Financial impact:** [If applicable]
- **Reputation impact:** [If applicable]

## Timeline

All times in [timezone].

| Time | Event |
|------|-------|
| 09:15 | Alert triggered: unusual database query volume |
| 09:18 | On-call engineer acknowledged alert |
| 09:25 | Investigation started, Incident Lead assigned |
| ... | ... |
| 14:30 | Incident resolved, monitoring confirmed normal |

## Root cause analysis

### What happened

[Technical explanation of what went wrong. Be specific.]

### Why it happened

[The Five Whys analysis. Get to systemic causes.]

### Contributing factors

[Other things that made it worse or delayed response.]

## What went well

- [Specific thing that worked]
- [Specific thing that worked]
- [Recognition for individuals or teams]

## What could improve

- [Specific area for improvement]
- [Specific area for improvement]

## Action items

| Action | Owner | Due date | Status |
|--------|-------|----------|--------|
| Add bucket permission check to CI/CD | @alice | 2024-03-15 | In progress |
| Document secure sharing patterns | @bob | 2024-03-22 | Not started |
| Assign cloud security ownership | @cto | 2024-03-01 | Done |

## Lessons learned

[Key takeaways that others should know about. What would you tell your 
past self before this incident?]

## Appendix

[Relevant logs, screenshots, or technical details that support the analysis.]

Following up on action items

Action items are useless if they never get done.

Follow-up process:

Assign real owners — Not teams, specific people
Set realistic due dates — Rushed fixes cause new incidents
Track in your issue tracker — Not just the postmortem doc
Check progress weekly — Security Champion reviews open items
Close the loop — When done, update the postmortem doc
Report completion — Share that actions were completed

Monthly action item review:

## Postmortem Action Item Review — March 2024

### Completed this month
- [Incident X] Add bucket permission check to CI/CD — Done 3/12
- [Incident Y] Update password policy — Done 3/8

### In progress
- [Incident X] Document secure sharing patterns — 75% complete, due 3/22
- [Incident Z] Implement rate limiting — On track for 3/30

### Overdue
- [Incident Y] Security training for new hires — Due 3/1, blocked on content
  - New due date: 3/31
  - Blocker: Waiting for training platform access

### Metrics
- Total open items: 8
- Completed this month: 5
- Overdue: 1

Building a security knowledge base

Incidents are expensive lessons. Don't waste them by forgetting what you learned.

What to document

Document type	Purpose	Example
Postmortems	Detailed incident analysis	S3 bucket exposure — March 2024
Runbooks	Step-by-step response procedures	How to respond to ransomware
Playbooks	Decision frameworks for scenarios	When to notify customers
Patterns	Secure ways to do common things	How to share files externally
Anti-patterns	Common mistakes to avoid	S3 bucket misconfigurations
Tool guides	How to use security tools	Using CloudTrail for investigation

Organizing the knowledge base

security-knowledge-base/
├── README.md                    # Index and how to use
├── incidents/
│   ├── 2024-03-15-s3-exposure.md
│   ├── 2024-02-28-phishing-account-compromise.md
│   └── template.md
├── runbooks/
│   ├── compromised-account.md
│   ├── ransomware.md
│   ├── data-breach.md
│   └── leaked-secrets.md
├── playbooks/
│   ├── customer-notification-decision.md
│   ├── severity-classification.md
│   └── escalation-criteria.md
├── patterns/
│   ├── secure-file-sharing.md
│   ├── secrets-management.md
│   └── access-control-setup.md
└── tools/
    ├── cloudtrail-investigation.md
    ├── burp-suite-basics.md
    └── log-analysis.md

Making knowledge findable

The best knowledge base is useless if people can't find things.

Make it searchable:

Use consistent naming conventions
Add tags or categories
Include common search terms in documents
Create an index page with links

Make it current:

Review quarterly for outdated content
Update after each incident
Mark deprecated content clearly
Include "last updated" dates

Make it accessible:

Store where people already work (wiki, Git, Notion)
Link from incident channels
Include in onboarding
Reference in training

Learning from others' incidents

You don't have to experience every incident yourself. Learn from public breaches.

Sources for breach analysis:

Krebs on Security
The Record by Recorded Future
Bleeping Computer
Vendor security blogs (CloudFlare, GitHub, etc.)
Conference talks (DEF CON, BSides)

Monthly external incident review:

Pick one significant public breach each month. Analyze it as if it happened to you:

What was the attack vector?
Could this happen to us?
What would we do differently?
What can we implement proactively?

Tabletop exercises

Tabletop exercises are simulated incidents where you practice response without the pressure of a real attack. Think of them as fire drills for security.

Why tabletop exercises matter

Test your plan — Find gaps before real incidents reveal them
Build muscle memory — Practice makes response automatic
Train the team — New people learn how you respond
Identify confusion — Who does what? Now you'll know.
Reduce stress — Familiar situations are less scary

Planning a tabletop exercise

Frequency: Quarterly at minimum, monthly if you can

Duration: 1-2 hours

Participants:

Incident response team
Leadership (at least occasionally)
Anyone who would be involved in real incidents

Roles:

Role	Responsibility
Facilitator	Runs the exercise, provides scenario updates
Observer	Takes notes on process, doesn't participate
Participants	Respond to the scenario as they would in reality

Tabletop exercise format

1. Pre-exercise preparation (Facilitator)

Write the scenario with realistic details
Prepare 3-4 "injects" (new information revealed during exercise)
Notify participants of time commitment
Set up a dedicated channel/room

2. Exercise structure

Introduction (10 min): Explain rules ("respond as if this were real"), clarify this is practice not evaluation, read the initial scenario.

Response Phase 1 (20–30 min): Team discusses initial response. Facilitator asks probing questions. Inject 1: new information changes the situation.

Response Phase 2 (20–30 min): Team adjusts based on new information. Facilitator challenges assumptions. Inject 2: escalation or complication.

Resolution (15–20 min): Team works toward resolution. Inject 3: final twist (optional). Wrap up the scenario.

Debrief (20–30 min): What went well? What was confusing? What would we do differently? Action items for improvement.

Sample tabletop scenario: Data breach

Initial scenario:

## Scenario: Data Breach

**Date/Time:** Tuesday, 2:30 PM

**Situation:**
A security researcher has contacted your security@ email address. They claim 
to have found a database backup file containing customer information publicly 
accessible on one of your cloud storage buckets. They've provided a screenshot 
showing customer names, email addresses, and hashed passwords. They say they 
haven't shared this with anyone else yet, and are willing to work with you 
before disclosing.

**What do you do?**

Inject 1 (after 20 min):

## New Information

You've verified the researcher's claim. The bucket was indeed public. 
CloudTrail logs show it's been public for 3 weeks. The backup contains 
12,000 customer records. You've made the bucket private.

Your CEO is asking for an update — they have a board call in 2 hours.

**Additional questions:**
- What do you tell the CEO?
- Do you need to notify customers?
- What's your regulatory timeline?

Inject 2 (after 40 min):

## New Information

A tech journalist has DM'd your company Twitter account asking for 
comment on "the data breach." They say they're publishing a story 
tomorrow morning.

Also, your legal counsel has reminded you that you have customers 
in the EU (GDPR) and California (CCPA).

**Additional questions:**
- How do you handle the journalist?
- What's your customer communication plan?
- Who's drafting the public statement?

Inject 3 (after 60 min):

## Final Information

The backup was created by a contractor who left 6 months ago. They 
were using a personal cloud account to work from home. You don't 
have logs of what else they might have copied.

The story has been published. It's trending on Hacker News.

**Wrap-up questions:**
- What's your 24-hour action plan?
- How do you handle the contractor situation?
- What systemic changes would prevent this?

Other tabletop scenarios

Ransomware attack:

8 AM Monday, employees can't access files
Ransom demand: $50,000 in Bitcoin within 48 hours
Injects: Backups are from 2 weeks ago; attackers threaten to publish data

Compromised executive account:

CFO's email is sending wire transfer requests
Finance already processed one for $25,000
Injects: CFO is on vacation with limited connectivity; attacker is responding to attempts to verify

Insider threat:

Departing employee downloaded customer list before resignation
They're joining a competitor
Injects: Legal limitations on what you can do; the data is already on personal devices

Supply chain attack:

Critical vendor announces they were breached
They had access to your production environment
Injects: Vendor is unresponsive; you find unauthorized API calls in your logs

Debrief questions

After the exercise, discuss:

Process:
- Did everyone know their role?
- Was communication clear?
- Did we follow our IRP?
- Where did we get stuck?
Decisions:
- Were decisions made quickly enough?
- Did we have the information we needed?
- Who had authority to decide what?
Communication:
- Would leadership be satisfied with our updates?
- Did we consider all stakeholders?
- Was external communication handled well?
Resources:
- Did we have the right people involved?
- What tools or information were missing?
- Do we need external help we don't have arranged?
Improvements:
- What do we need to change?
- What should we add to our IRP?
- What training is needed?

After the exercise

Immediate:

Document findings while fresh
Share summary with participants
Thank everyone for their time

Within 1 week:

Create action items from gaps identified
Update IRP or runbooks if needed
Schedule next exercise

Track over time:

Are we getting faster at response?
Are the same issues recurring?
Is participation improving?

Common mistakes

Treating postmortems as punishment — People stop reporting if they fear blame
Not following up on action items — Lessons aren't learned if nothing changes
Postmortem fatigue — Not every incident needs a full postmortem
Skipping tabletops because "we're too busy" — Practice prevents expensive mistakes
Documenting for compliance, not learning — Write for future you, not auditors
Same people always involved — Rotate who leads exercises
Unrealistic scenarios — Base tabletops on your actual risks
No knowledge base — Reinventing the wheel each incident
Leadership opt-out — Executives need to participate sometimes
Perfecting the plan instead of practicing — A tested okay plan beats an untested perfect plan

Workshop: incident response practice

Part 1: Create incident documentation template (1 hour)

Customize the postmortem template:
- Adapt to your company's structure
- Add relevant sections for your industry
- Create fillable version in your wiki
Create incident log template:
- Simple table format
- Pre-filled headers
- Instructions for use
Set up knowledge base structure:
- Create folder structure
- Add README with instructions
- Link from main documentation

Part 2: Run a tabletop exercise (2 hours)

Prepare (30 min before):
- Choose a scenario relevant to your risks
- Prepare 2-3 injects
- Set up dedicated channel
Run the exercise (1 hour):
- Read initial scenario
- Let team respond
- Inject new information at intervals
- Take notes on process
Debrief (30-45 min):
- Discuss what worked
- Identify gaps
- Create action items
- Document findings

Artifacts to produce

After this workshop, you should have:

Customized postmortem template
Incident log template
Knowledge base structure set up
First tabletop exercise completed
Action items from tabletop findings
Schedule for future tabletop exercises

Self-check questions

What's the difference between a postmortem and an incident log?
Why is blameless culture important for security?
What's the Five Whys technique and when do you use it?
How often should you run tabletop exercises?
What should you do if the same issues keep appearing in postmortems?
Who should attend a postmortem meeting?
What's the purpose of "injects" in a tabletop exercise?
How do you handle blame during a postmortem discussion?
Where should you store incident documentation?
What's the difference between a runbook and a playbook?

How to explain this to leadership

The pitch: "We'll have security incidents. The question is whether we learn from them. I want to run regular practice exercises and document what we learn so we get better over time instead of repeating mistakes."

What you need:

2 hours quarterly for tabletop exercises
Participation from key people (including leadership occasionally)
A place to document incidents (wiki, Notion, etc.)
Authority to follow up on action items

The ROI:

Faster incident response (practice makes permanent)
Fewer repeated incidents (we learn from mistakes)
Lower incident costs (catch problems earlier)
Better prepared team (less panic, better decisions)
Compliance checkbox (many frameworks require incident response testing)

Metrics to track:

Mean time to detect incidents
Mean time to resolve incidents
Number of action items completed vs. opened
Tabletop exercise frequency
Recurring issues (should decrease)

Conclusion

Every incident is the most valuable input your security program will ever get — if you treat it that way. The blameless postmortem isn't a formality. It's how you turn a bad day into a better system.

The organizations that get better after incidents are the ones that run postmortems even when they're uncomfortable.

What's next

Next: measuring security program effectiveness — how to track whether any of this is actually working.

What really happens during an incident​

The first 30 minutes​

The Incident Lead role​

Running the incident channel​

When to escalate​

Blameless postmortems​

Why blameless matters​

When to run a postmortem​

The postmortem meeting​

The Five Whys technique​

Facilitating without blame​

The postmortem document​

Following up on action items​

Building a security knowledge base​

What to document​

Organizing the knowledge base​

Making knowledge findable​

Learning from others' incidents​

Tabletop exercises​

Why tabletop exercises matter​

Planning a tabletop exercise​

Tabletop exercise format​

Sample tabletop scenario: Data breach​

Other tabletop scenarios​

Debrief questions​

After the exercise​

Common mistakes​

Workshop: incident response practice​

Part 1: Create incident documentation template (1 hour)​

Part 2: Run a tabletop exercise (2 hours)​

Artifacts to produce​

Self-check questions​

How to explain this to leadership​

Conclusion​

What's next​