Working with incidents and learning lessons
Every company will face security incidents. The question isn't whether you'll have them — it's how you'll respond and what you'll learn. Companies that handle incidents well get stronger after each one. Companies that handle them poorly repeat the same mistakes.
This chapter covers the practical side of incident response: what actually happens during an incident, how to run blameless postmortems that produce real improvements, documenting incidents for future reference, and running tabletop exercises to practice before real incidents hit.
This chapter assumes you have an IRP in place (covered in Security policies and procedures). Here we focus on execution and learning, not the plan itself.
What really happens during an incident
The IRP gives you the framework. Here's what it actually feels like and how to navigate the chaos.
The first 30 minutes
The first half hour sets the tone. Most incidents are either contained quickly or spiral into multi-day nightmares based on what happens early.
What goes wrong:
- Nobody takes ownership ("I thought you were handling it")
- Time spent figuring out who to call
- Evidence destroyed by well-meaning fixes
- Panic decisions without thinking through consequences
- Communication gaps (leadership finds out from Twitter)
What should happen:
Minutes 0–5: Detection and initial assessment
- Alert received or issue reported
- Quick triage: is this real, how bad?
- Incident Lead identified and takes ownership
- Decision: escalate or investigate quietly
Minutes 5–15: Mobilization
- Create incident channel (
#incident-YYYY-MM-DD-brief-name) - Alert response team members
- Start incident log (time, action, who, notes)
- Assess: is immediate containment needed?
Minutes 15–30: Initial response
- Execute containment if needed (disable account, isolate system)
- Preserve evidence before making changes
- Brief update to leadership (for High/Critical)
- Assign investigation tasks
The Incident Lead role
Someone needs to own the incident. This is usually the Security Champion, but could be any senior technical person.
Incident Lead responsibilities:
| Responsibility | What it means |
|---|---|
| Coordination | Make sure everyone knows what they're doing |
| Decision-making | Make calls when there's no clear answer |
| Communication | Keep stakeholders informed |
| Documentation | Ensure the incident log is maintained |
| Time management | Set checkpoints, avoid rabbit holes |
| Escalation | Know when to call for help |
What the Incident Lead is NOT doing:
- Deep technical investigation (delegate this)
- Writing code to fix things (delegate this)
- Customer communication (delegate this)
- Everything at once (you coordinate, others execute)
Running the incident channel
Your incident Slack/Teams channel is mission control. Keep it focused.
Channel discipline:
Good channel behavior:
- Status updates with timestamps
- Clear task assignments: "@alice please check CloudTrail for the last 24h"
- Decisions documented: "Decision: Rotating all API keys. Reason: Can't confirm scope."
- Questions clearly marked: "QUESTION: Do we have backups from before March 1?"
Bad channel behavior:
- Speculation without evidence
- Side conversations about unrelated topics
- Multiple people doing the same task
- Updates without context
Periodic status updates:
Every 30-60 minutes, Incident Lead posts a summary:
## Status Update — 14:30
**Current status:** Investigating
**Severity:** High
**Duration:** 2 hours
**What we know:**
- Unauthorized access to user database confirmed
- Access occurred via compromised admin credentials
- ~500 records potentially accessed
**What we're doing:**
- Alice: Analyzing access logs to determine full scope
- Bob: Rotating all admin credentials
- Carol: Preparing customer notification draft
**Open questions:**
- When did the compromise occur? (Reviewing logs back to Jan 1)
- Were credentials stolen or guessed?
**Next update:** 15:30 or sooner if significant development
When to escalate
Not every incident needs the CEO at 2 AM. But some do.
| Escalate immediately | Can wait until morning | Handle internally |
|---|---|---|
| Active attacker in systems | Contained breach, small scope | Policy violation, no data impact |
| Customer data exfiltration confirmed | Suspicious activity under investigation | Single account compromise (non-admin) |
| Ransomware or destructive malware | Vulnerability discovered (not exploited) | Failed attack attempt |
| Public disclosure imminent | Third-party breach affecting us | Near-miss with no impact |
| Legal/regulatory implications | Malware on single endpoint (contained) | Security tool alerts (normal volume) |
How to escalate:
To: [CEO/CTO name]
Subject: Security Incident - [Brief description] - [Severity]
We have a [severity] security incident that requires your awareness.
**What happened:** [2-3 sentences]
**Current impact:** [What's affected right now]
**What we're doing:** [Actions in progress]
**What we need from you:** [Decision needed, if any]
I'll update you in [timeframe] or immediately if status changes.
[Your name] - [Phone number for callback]
Blameless postmortems
The postmortem is where learning happens. Do it wrong and people hide mistakes. Do it right and you build a culture of continuous improvement.
Why blameless matters
When people fear blame, they:
- Don't report issues ("maybe nobody will notice")
- Hide their involvement ("it wasn't me")
- Cover up details ("let's just fix it and move on")
- Avoid risk entirely ("I'm not touching that system")
When people feel safe, they:
- Report issues early ("I think I might have caused a problem")
- Share details openly ("here's exactly what happened")
- Propose improvements ("here's how we could prevent this")
- Take ownership ("I'll fix it and document the process")
Blameless doesn't mean unaccountable. It means we focus on systems, not individuals. The question isn't "who screwed up?" but "what allowed this to happen and how do we prevent it?"
When to run a postmortem
Not every incident needs a formal postmortem. Here's a guide:
| Incident type | Postmortem? | Format |
|---|---|---|
| Critical severity | Yes, mandatory | Full postmortem meeting + document |
| High severity | Yes | Full or abbreviated |
| Medium severity | Usually | Abbreviated or async |
| Low severity | Optional | Quick notes, no meeting |
| Near-miss with lessons | Yes | Abbreviated |
Also run postmortems when:
- The response itself had problems (even if the incident was minor)
- There are systemic lessons to learn
- Someone requests one
- It's a new type of incident you haven't seen before
The postmortem meeting
Timing: Within 1 week of incident resolution (memories fade)
Duration: 45-90 minutes
Attendees:
- Everyone involved in the response
- Relevant stakeholders (not the whole company)
- Facilitator (ideally not the Incident Lead — they need to participate)
Agenda:
## Postmortem Meeting Agenda
1. **Set the stage (5 min)**
- Reminder: This is blameless. Focus on systems, not people.
- Goal: Understand what happened and prevent recurrence.
2. **Timeline review (15-20 min)**
- Walk through the incident chronologically
- Fill in gaps in the timeline
- No judgment, just facts
3. **What went well (10 min)**
- What worked? What should we do more of?
- Recognition for good responses
4. **What could improve (15-20 min)**
- Where did we struggle?
- What slowed us down?
- What information was missing?
5. **Root cause analysis (15 min)**
- Why did this happen?
- Keep asking "why" until you reach systemic issues
- Usually 3-5 levels deep
6. **Action items (10-15 min)**
- What specific changes will prevent recurrence?
- Assign owners and due dates
- Be realistic about capacity
7. **Wrap-up (5 min)**
- Confirm action item owners
- Set follow-up date to check progress
- Thank everyone
The Five Whys technique
Keep asking "why" until you reach root causes you can actually fix.
Example:
Incident: Customer data was exposed via public S3 bucket
Why was the bucket public?
→ Developer set it to public during testing and forgot to change it.
Why did they set it to public?
→ They needed external access for a demo and didn't know another way.
Why didn't they know another way?
→ We don't have documented patterns for secure external sharing.
Why don't we have documented patterns?
→ Nobody has taken ownership of cloud security documentation.
Why hasn't anyone taken ownership?
→ Cloud security responsibilities aren't clearly defined.
Root causes:
1. No secure sharing documentation
2. Unclear cloud security ownership
3. No process to verify bucket permissions before production
Action items:
1. Document secure external sharing patterns
2. Assign cloud security ownership to DevOps team
3. Add bucket permission check to deployment pipeline
Notice how we moved from "developer made a mistake" to systemic issues we can actually fix.
Facilitating without blame
The facilitator's job is to keep the discussion productive and safe.
Language to use:
| Instead of... | Say... |
|---|---|
| "Who made this change?" | "Let's look at what changes were made and when." |
| "Why didn't you catch this?" | "What would have helped us catch this earlier?" |
| "That was a mistake." | "This is where things started to go wrong. What contributed?" |
| "You should have known." | "What information would have been helpful here?" |
| "Whose fault is this?" | "What systems or processes could we improve?" |
Redirect blame when it appears:
Participant: "Bob should have known not to do that."
Facilitator: "Let's focus on the system. If someone could make this mistake, others might too. What would prevent anyone from making this error?"
The postmortem document
The document is the permanent record. It should be useful for anyone who reads it later.
Postmortem template:
# Postmortem: [Incident Title]
**Date of incident:** YYYY-MM-DD
**Date of postmortem:** YYYY-MM-DD
**Authors:** [Names]
**Status:** Draft / Final
**Severity:** Critical / High / Medium / Low
## Executive summary
[2-3 paragraphs that anyone in the company could understand. What happened,
what was the impact, what are we doing about it.]
## Impact
- **Duration:** [Time from detection to resolution]
- **Users affected:** [Number and type]
- **Data affected:** [What data, how much]
- **Financial impact:** [If applicable]
- **Reputation impact:** [If applicable]
## Timeline
All times in [timezone].
| Time | Event |
|------|-------|
| 09:15 | Alert triggered: unusual database query volume |
| 09:18 | On-call engineer acknowledged alert |
| 09:25 | Investigation started, Incident Lead assigned |
| ... | ... |
| 14:30 | Incident resolved, monitoring confirmed normal |
## Root cause analysis
### What happened
[Technical explanation of what went wrong. Be specific.]
### Why it happened
[The Five Whys analysis. Get to systemic causes.]
### Contributing factors
[Other things that made it worse or delayed response.]
## What went well
- [Specific thing that worked]
- [Specific thing that worked]
- [Recognition for individuals or teams]
## What could improve
- [Specific area for improvement]
- [Specific area for improvement]
## Action items
| Action | Owner | Due date | Status |
|--------|-------|----------|--------|
| Add bucket permission check to CI/CD | @alice | 2024-03-15 | In progress |
| Document secure sharing patterns | @bob | 2024-03-22 | Not started |
| Assign cloud security ownership | @cto | 2024-03-01 | Done |
## Lessons learned
[Key takeaways that others should know about. What would you tell your
past self before this incident?]
## Appendix
[Relevant logs, screenshots, or technical details that support the analysis.]
Following up on action items
Action items are useless if they never get done.
Follow-up process:
- Assign real owners — Not teams, specific people
- Set realistic due dates — Rushed fixes cause new incidents
- Track in your issue tracker — Not just the postmortem doc
- Check progress weekly — Security Champion reviews open items
- Close the loop — When done, update the postmortem doc
- Report completion — Share that actions were completed
Monthly action item review:
## Postmortem Action Item Review — March 2024
### Completed this month
- [Incident X] Add bucket permission check to CI/CD — Done 3/12
- [Incident Y] Update password policy — Done 3/8
### In progress
- [Incident X] Document secure sharing patterns — 75% complete, due 3/22
- [Incident Z] Implement rate limiting — On track for 3/30
### Overdue
- [Incident Y] Security training for new hires — Due 3/1, blocked on content
- New due date: 3/31
- Blocker: Waiting for training platform access
### Metrics
- Total open items: 8
- Completed this month: 5
- Overdue: 1
Building a security knowledge base
Incidents are expensive lessons. Don't waste them by forgetting what you learned.
What to document
| Document type | Purpose | Example |
|---|---|---|
| Postmortems | Detailed incident analysis | S3 bucket exposure — March 2024 |
| Runbooks | Step-by-step response procedures | How to respond to ransomware |
| Playbooks | Decision frameworks for scenarios | When to notify customers |
| Patterns | Secure ways to do common things | How to share files externally |
| Anti-patterns | Common mistakes to avoid | S3 bucket misconfigurations |
| Tool guides | How to use security tools | Using CloudTrail for investigation |
Organizing the knowledge base
security-knowledge-base/
├── README.md # Index and how to use
├── incidents/
│ ├── 2024-03-15-s3-exposure.md
│ ├── 2024-02-28-phishing-account-compromise.md
│ └── template.md
├── runbooks/
│ ├── compromised-account.md
│ ├── ransomware.md
│ ├── data-breach.md
│ └── leaked-secrets.md
├── playbooks/
│ ├── customer-notification-decision.md
│ ├── severity-classification.md
│ └── escalation-criteria.md
├── patterns/
│ ├── secure-file-sharing.md
│ ├── secrets-management.md
│ └── access-control-setup.md
└── tools/
├── cloudtrail-investigation.md
├── burp-suite-basics.md
└── log-analysis.md
Making knowledge findable
The best knowledge base is useless if people can't find things.
Make it searchable:
- Use consistent naming conventions
- Add tags or categories
- Include common search terms in documents
- Create an index page with links
Make it current:
- Review quarterly for outdated content
- Update after each incident
- Mark deprecated content clearly
- Include "last updated" dates
Make it accessible:
- Store where people already work (wiki, Git, Notion)
- Link from incident channels
- Include in onboarding
- Reference in training
Learning from others' incidents
You don't have to experience every incident yourself. Learn from public breaches.
Sources for breach analysis:
- Krebs on Security
- The Record by Recorded Future
- Bleeping Computer
- Vendor security blogs (CloudFlare, GitHub, etc.)
- Conference talks (DEF CON, BSides)
Monthly external incident review:
Pick one significant public breach each month. Analyze it as if it happened to you:
- What was the attack vector?
- Could this happen to us?
- What would we do differently?
- What can we implement proactively?
Tabletop exercises
Tabletop exercises are simulated incidents where you practice response without the pressure of a real attack. Think of them as fire drills for security.
Why tabletop exercises matter
- Test your plan — Find gaps before real incidents reveal them
- Build muscle memory — Practice makes response automatic
- Train the team — New people learn how you respond
- Identify confusion — Who does what? Now you'll know.
- Reduce stress — Familiar situations are less scary
Planning a tabletop exercise
Frequency: Quarterly at minimum, monthly if you can
Duration: 1-2 hours
Participants:
- Incident response team
- Leadership (at least occasionally)
- Anyone who would be involved in real incidents
Roles:
| Role | Responsibility |
|---|---|
| Facilitator | Runs the exercise, provides scenario updates |
| Observer | Takes notes on process, doesn't participate |
| Participants | Respond to the scenario as they would in reality |
Tabletop exercise format
1. Pre-exercise preparation (Facilitator)
- Write the scenario with realistic details
- Prepare 3-4 "injects" (new information revealed during exercise)
- Notify participants of time commitment
- Set up a dedicated channel/room
2. Exercise structure
Introduction (10 min): Explain rules ("respond as if this were real"), clarify this is practice not evaluation, read the initial scenario.
Response Phase 1 (20–30 min): Team discusses initial response. Facilitator asks probing questions. Inject 1: new information changes the situation.
Response Phase 2 (20–30 min): Team adjusts based on new information. Facilitator challenges assumptions. Inject 2: escalation or complication.
Resolution (15–20 min): Team works toward resolution. Inject 3: final twist (optional). Wrap up the scenario.
Debrief (20–30 min): What went well? What was confusing? What would we do differently? Action items for improvement.
Sample tabletop scenario: Data breach
Initial scenario:
## Scenario: Data Breach
**Date/Time:** Tuesday, 2:30 PM
**Situation:**
A security researcher has contacted your security@ email address. They claim
to have found a database backup file containing customer information publicly
accessible on one of your cloud storage buckets. They've provided a screenshot
showing customer names, email addresses, and hashed passwords. They say they
haven't shared this with anyone else yet, and are willing to work with you
before disclosing.
**What do you do?**
Inject 1 (after 20 min):
## New Information
You've verified the researcher's claim. The bucket was indeed public.
CloudTrail logs show it's been public for 3 weeks. The backup contains
12,000 customer records. You've made the bucket private.
Your CEO is asking for an update — they have a board call in 2 hours.
**Additional questions:**
- What do you tell the CEO?
- Do you need to notify customers?
- What's your regulatory timeline?
Inject 2 (after 40 min):
## New Information
A tech journalist has DM'd your company Twitter account asking for
comment on "the data breach." They say they're publishing a story
tomorrow morning.
Also, your legal counsel has reminded you that you have customers
in the EU (GDPR) and California (CCPA).
**Additional questions:**
- How do you handle the journalist?
- What's your customer communication plan?
- Who's drafting the public statement?
Inject 3 (after 60 min):
## Final Information
The backup was created by a contractor who left 6 months ago. They
were using a personal cloud account to work from home. You don't
have logs of what else they might have copied.
The story has been published. It's trending on Hacker News.
**Wrap-up questions:**
- What's your 24-hour action plan?
- How do you handle the contractor situation?
- What systemic changes would prevent this?
Other tabletop scenarios
Ransomware attack:
- 8 AM Monday, employees can't access files
- Ransom demand: $50,000 in Bitcoin within 48 hours
- Injects: Backups are from 2 weeks ago; attackers threaten to publish data
Compromised executive account:
- CFO's email is sending wire transfer requests
- Finance already processed one for $25,000
- Injects: CFO is on vacation with limited connectivity; attacker is responding to attempts to verify
Insider threat:
- Departing employee downloaded customer list before resignation
- They're joining a competitor
- Injects: Legal limitations on what you can do; the data is already on personal devices
Supply chain attack:
- Critical vendor announces they were breached
- They had access to your production environment
- Injects: Vendor is unresponsive; you find unauthorized API calls in your logs
Debrief questions
After the exercise, discuss:
-
Process:
- Did everyone know their role?
- Was communication clear?
- Did we follow our IRP?
- Where did we get stuck?
-
Decisions:
- Were decisions made quickly enough?
- Did we have the information we needed?
- Who had authority to decide what?
-
Communication:
- Would leadership be satisfied with our updates?
- Did we consider all stakeholders?
- Was external communication handled well?
-
Resources:
- Did we have the right people involved?
- What tools or information were missing?
- Do we need external help we don't have arranged?
-
Improvements:
- What do we need to change?
- What should we add to our IRP?
- What training is needed?
After the exercise
Immediate:
- Document findings while fresh
- Share summary with participants
- Thank everyone for their time
Within 1 week:
- Create action items from gaps identified
- Update IRP or runbooks if needed
- Schedule next exercise
Track over time:
- Are we getting faster at response?
- Are the same issues recurring?
- Is participation improving?
Common mistakes
- Treating postmortems as punishment — People stop reporting if they fear blame
- Not following up on action items — Lessons aren't learned if nothing changes
- Postmortem fatigue — Not every incident needs a full postmortem
- Skipping tabletops because "we're too busy" — Practice prevents expensive mistakes
- Documenting for compliance, not learning — Write for future you, not auditors
- Same people always involved — Rotate who leads exercises
- Unrealistic scenarios — Base tabletops on your actual risks
- No knowledge base — Reinventing the wheel each incident
- Leadership opt-out — Executives need to participate sometimes
- Perfecting the plan instead of practicing — A tested okay plan beats an untested perfect plan
Workshop: incident response practice
Part 1: Create incident documentation template (1 hour)
-
Customize the postmortem template:
- Adapt to your company's structure
- Add relevant sections for your industry
- Create fillable version in your wiki
-
Create incident log template:
- Simple table format
- Pre-filled headers
- Instructions for use
-
Set up knowledge base structure:
- Create folder structure
- Add README with instructions
- Link from main documentation
Part 2: Run a tabletop exercise (2 hours)
-
Prepare (30 min before):
- Choose a scenario relevant to your risks
- Prepare 2-3 injects
- Set up dedicated channel
-
Run the exercise (1 hour):
- Read initial scenario
- Let team respond
- Inject new information at intervals
- Take notes on process
-
Debrief (30-45 min):
- Discuss what worked
- Identify gaps
- Create action items
- Document findings
Artifacts to produce
After this workshop, you should have:
- Customized postmortem template
- Incident log template
- Knowledge base structure set up
- First tabletop exercise completed
- Action items from tabletop findings
- Schedule for future tabletop exercises
Self-check questions
- What's the difference between a postmortem and an incident log?
- Why is blameless culture important for security?
- What's the Five Whys technique and when do you use it?
- How often should you run tabletop exercises?
- What should you do if the same issues keep appearing in postmortems?
- Who should attend a postmortem meeting?
- What's the purpose of "injects" in a tabletop exercise?
- How do you handle blame during a postmortem discussion?
- Where should you store incident documentation?
- What's the difference between a runbook and a playbook?
How to explain this to leadership
The pitch: "We'll have security incidents. The question is whether we learn from them. I want to run regular practice exercises and document what we learn so we get better over time instead of repeating mistakes."
What you need:
- 2 hours quarterly for tabletop exercises
- Participation from key people (including leadership occasionally)
- A place to document incidents (wiki, Notion, etc.)
- Authority to follow up on action items
The ROI:
- Faster incident response (practice makes permanent)
- Fewer repeated incidents (we learn from mistakes)
- Lower incident costs (catch problems earlier)
- Better prepared team (less panic, better decisions)
- Compliance checkbox (many frameworks require incident response testing)
Metrics to track:
- Mean time to detect incidents
- Mean time to resolve incidents
- Number of action items completed vs. opened
- Tabletop exercise frequency
- Recurring issues (should decrease)
Conclusion
Every incident is the most valuable input your security program will ever get — if you treat it that way. The blameless postmortem isn't a formality. It's how you turn a bad day into a better system.
The organizations that get better after incidents are the ones that run postmortems even when they're uncomfortable.
What's next
Next: measuring security program effectiveness — how to track whether any of this is actually working.